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From the 
Editor>in-Chief 



One year later 


Now THAT I’ve been ieee 

Micro's EIC for one year, it 
is time to check the balance 
sheets; in my introduction of 
Febmary 1991 1 expressed 
some of my “good inten¬ 
tions.” The last two words in 
the first paragraph in that 
introduction (I'm sure you 
keep old issues, so you will 
know what they were) were 
surely true. 

In this year some things 
have been accomplished, 
while others still must be 
done. 1 am sure that IEEE Micro has been a good 
channel to bring a lot of useful infomiation to 
you readers. I myself learned a lot of things. First 
of all I learned how important the work of the 
managing editor and staff is to the magazine. I 
understand now that a magazine could survive 
without an FIG but not without the staff. 

I also learned that Europeans can directly com¬ 
municate with people in the US only in a nar¬ 
row time window (and that everybody in the US 
makes use of an answering machine!). Some big 
organization should start a project to unify the 
time on a world basis. (It would be quite easy; 
You modify the sun into a ring-shaped variable 
star with the earth in. the center.... Sometimes I 
think that such a project could be simpler than 
trying to make a magazine exactly as we think it 
should be.) 

Some members of the editorial board com¬ 
pleted their temis—we must thank them for their 
contributions—and others joined the board. Our 
editorial board is growing with capable people 
from academia and industry, while it keeps the 
international balance between the US, Europe, 


and the Far East. 

A magazine results from the combined efforts 
of authors, editors, staff, and readers. The first 
three groups of people depend on the last— 
you, the readers—to continue to support the 
magazine by providing proposals, comments, 
and suggestions. As in the past, we promise to 
always consider these comments. 

What changed in the magazine during 1991? 
Costs were reduced (credit to our staff), allow¬ 
ing us to bring you the most pages possible in 
spite of inflation and general economy problems. 

We have also devoted efforts to reducing the 
amount of time spent in reviewing manuscripts. 
The shorter review cycle serves the needs of 
authtjrs (their work reaches a wide audience in 
a shorter time, or at least they know their fate 
sooner) and readers (they receive updated in- 
fonnation quickly). The theme issues in 1991 
brought coordinated sets of articles on some of 
the hottest areas in microelectronics and 
microsystems. This will continue in 1992 and 
1993. (See p. 41 for clues.) 

The current issue is a “general” one, so you 
find an assortment of articles that address vari¬ 
ous aspects of high-perfomiance computing. The 
first, “The Scalable Coherent Interface and Re¬ 
lated Standards Projects” by Dave Gustavson, 
provides an insight into an IEEE project (P1596- 
SCI). SCI proposes new solutions for the com¬ 
munication bottleneck intrinsic in bus-based 
multiprocessor architectures. Instead of wider and 
faster buses (which also means more expensive 
and power-hungry buses), we can switch to 
nonbused interconnection schemes and still keep 
the configurability we generally find in bused 
systems. SCI is actually a set of projects address¬ 
ing the many aspects of an interconnection struc¬ 
ture. from the lowest level (the physical 
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communication channel) to higher pro¬ 
tocol layers dealing with cache coher¬ 
ency. It ensures tight coordination with 
other standards projects. 

Next Daniel Mann’s “Unix and the 
Am29000 Microprocessor” article shows 
how the hardware features of a RISC 
processor can be exploited to optimize 
its performance for a given operating 
system. 

With the third article we switch to 
completely different architectures. 
“Hardware Requirements for Neural 
Network Pattern Classifiers: A Case 
Study and Implementation” by Boser 
et al. describes in detail an ASINC, or 
application-specific integrated neural 
circuit. This is so new an acronym that 
it does not yet appear in Carl Warren’s 
glossary (see Micro Standards begin¬ 
ning on p. 69). 

Efficient and affordable handwritten 
character recognition can open the way 
to a variety of new industrial applica¬ 
tions. 1 am convinced that we will see 
such features within the sensor itself 
in a few years (the same way keyboards 
now have embedded controllers that 
send ASCII characters directly to the 
host microprocessor). There is a good 
chance that .such intelligent .sensors will 
use the neural approach, so take a close 
look at this article to get an idea of 
what will be on the market in the near 
future. 

A common denominator for these 
three articles is that each of them 
achieves high performance, thanks to 
some specific silicon basis (protocol 
chips, RISC, neural hardware). The next 
article, “Experimentation with Hyper¬ 
cube Database Engines” by Frieder et 
al., describes efficient algorithms for 
handling distributed shared data in a 
specific architecture. 

Even the best hardware and the best 
software can provide good results only 
if they can work con'ectly. In a modu¬ 
lar system, this means each module 
must comply with the specification of 
the common interface (the bus). The 
la.st article in this issue, “Conformance 
Testing of VMEbus and Multibus II 


Products” by Adams et al., describes a 
technique used to verify compliance 
of boards to a given bus specification. 






In the mailbag 


(LK: liked; DLK: disliked LTS: like to 
see) 

In this mailbag the total number of 
readers who did not like split articles 
amounts to nine. Their comments are 
not individually reported here; 1 think 
they already received their answer in 
December. Thank you all for the ad¬ 
vice.—D.D.C. 

February 1991 

LK: Hardware design; LTS: network 
systems.—S.T., Moscow 

April 1991 

LK: Optical architecture: excellent 
article, though I think there were 
some errors in certain figures and not 
enough clarity in parts of the text. 
(Please be more precise, so we can 
inform the author and ask for correc¬ 
tions if required.—D.D.C.) LTS: RISC 
architectures.—-J .C.A.A.,Tlalnepantla, 
Mexico 

LK: Software and hardware com¬ 
puter science and its electronics; LTS: 
... U.P.S. units for computers.— 
M.M.M., Alexandria, Egypt 

LK: Letters, Micro News, Micro 
View, and other chapters; LTS: a chap¬ 
ter [identifying] computer idioms.— 
M.G., Isfahan, Iran (An acronym 
dictionary appears in this is.sue,— 
D.D.C.) 

June 1991 

LK: Richard Stem, Book Review— 
T.S., Tromso, Noiway 

LK: On The Edge; it has focused 
very well on the problem. Moreover, 
author used simple language to de¬ 
scribe it... very important... to teach 
something. I’d like to see more ar¬ 
ticles like this.—C.G., Verona, Italy 


(These compliments are for Carl War¬ 
ren and James Gafford, who were 
responsible for this column.—D.D.C.) 

LK: The whole issue; DLK: “Light 
at the end of the chunnel.” It has 
nothing to do with the hot chips 
[theme]. (You are correct: Depart¬ 
ments are not necessarily linked with 
special themes.—D.D.C.) LTS: Hot 
Chips III—^T.P., Warsaw (You will get 
it in the April issue.—D.D.C.) 

LK: The overview of the BTRON/ 
286 specs.—E.D., Aalen, Germany 
LK: Enjoyed reading [about] iWarp 
and Data wave. How much does the 
iWarp and ITT Datawave chip cost? 
(The task of the designer—^who writes 
the article.s—is to minimize the manu¬ 
facturing cost, but they usually have 
little control on final pricing. It is bet¬ 
ter to ask a commercial representa¬ 
tive.—D.D.C.) DLK: the guest editors’ 
introduction of what is superscalar 
and pipelining; LTS: How supersca¬ 
lar and pipelining come into play. Do 
all chips have one or could they have 
both? This was not dear.—A.R.F., San 
Ramon, CA (Almost all microproces¬ 
sors announced in the last five years 
are pipelined; only a few recent ones 
are also superscalar.—M.D.H./ 
D.A.W., gue.st editors.) 

DLK: Rate monotonic scheduling, 
too little tutorial information.—P.F., 
Adliswil, Switzerland 

LK: iWarp: A 100 MOPS....—S.R.E, 
Saudi Arabia 

August 1991 

LK: Micro Law; I never miss it.— 
T.D.L., Cupertino, CA 

LTS: More practical software reports 
for science and engineering, Unix.— 
F.M.R., Munich 
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Law 


Richard H. Stern 

Obion, Spivak, 
McClelland, 

Maier & Neustadt, P.C. 

1755 Jefferson Davis 
Highway 

Suite 400 

Arlington, VA 22202 


Engineers can be disqualified, too. 


E uring the heyday of the mergers-and- 
acquisitions frenzy, the disqualification 
ploy became a common litigation tactic. 
The first thing a party would do in a lawsuit 
would be to file a motion that the other side’s 
lawyer should be disqualified from representing 
his or her client, because at some time or an¬ 
other counsel had perfomied some kind of legal 
representation of the adversary party who made 
the motion. 

The rationale for filing such motions, and for 
the couit when it went along with the move, 
was to prevent the appearance of impropriety. 
Never mind that the earlier legal representation 
had nothing to do with the present case. The 
public’s great confidence in the legal profession 
might be impaired if it even appeared that cli¬ 
ents might not be able to rely on the confidenti¬ 
ality of their relationship with their attorneys. How 
could clients feel safe hiring lawyers and telling 
them their secrets if the very same secrets would 
later be used against them to help an adversary? 

After a while, the courts became as cynical 
about these motions as those who filed them 
and recognized them as another ploy to put the 
other party at a tactical disadvantage. They be¬ 
came skeptical about arguments based on the 
body blow to our social order wrought by the 
appearance of impropriety and began to limit 
disqualification to situations where a party’s ad¬ 
versary would actually get the advantage of rel¬ 
evant business secrets disclosed in confidence 
in an earlier case. Such things as knowing the 
stRicture of the client’s hierarchy of values, the 
client’s aversion to or enthusiasm for risk, how 
the client thinks about things in general, or the 
client’s negotiating strategies, all went out the 
window as bases for disqualification. 

The rule now seems to be pretty close to this: 


The lawyer must really know (or be likely to 
know) where the relevant bodies are buried. For 
example, the court may disqualify counsel if the 
lawyer 

• wrote the patent application on the inven¬ 
tion that the lawyer now wants to assert 
(on behalf of an adversary of the inventor) 
is invalid; 

• is going to be a witness, because he or she 
wrote the patent application that suppos¬ 
edly was used to defraud the patent office; 

• represented the patent owner in another 
case involving a related patent, w'here coun¬ 
sel learned about the weaknesses of the 
client’s claim on this technology; or 

• represented one party to a license in work¬ 
ing out and drafting the agreement, where 
parties are now in a dispute over what the 
agreement was intended to do (what it 
means), and the lawyer wants to represent 
the other party in the dispute (or maybe 
now the lawyer plans to claim that the origi¬ 
nal license agreement is legally invalid). 

While things are pretty well sorted out on that 
front, a new frontier is opening up, which may 
be of greater concern to electrical engineers and 
computer science professionals. It now appears 
that they, too, can be disqualified. 

The Balde case. In a decision handed down 
last summer, Wang Laboratories, Inc. v. Toshiba 
Corp.,^ a federal court in Alexandria, Virginia, 
(see box) disqualified a computer consultant from 
representing NEC (a codefendant of Toshiba) 
because he had previously given Wang a pre¬ 
liminary opinion that its patent was invalid. In 
November 1990, Wang’s attorney had telephoned 
John Balde, a computer consultant. He asked 
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Balde if he was familiar with SIMM 
(single in-line memory module) tech¬ 
nology, and Balde said he was. The 
parties differ over what happened next. 
According to Wang, the lawyer retained 
Balde and agreed to pay him for his 
time. According to Balde, he was not 
retained, and he told the lawyer he 
would have to determine whether 
Wang’s SIMM patents were valid be¬ 
fore he would enter any agreement 
with the company. 

Wang’s lawyer then sent Balde a let¬ 
ter (dated November 14) transmitting 
various documents: the SIMM patents, 
some prior art publications, some ma¬ 
terials concerning patent infringement, 
and a long memorandum written by 
Wang’s attorney about the history of 
the prosecution of the patents before 
the patent office. He asked Balde to 
review the material, “so that we can 
discuss how best to explain the advan¬ 
tages to a computer designer of using” 
SIMM, and he suggested they meet af¬ 
ter Balde reviewed the material. 

The next day—watch out for this 
one, readers—Wang’s lawyer sent a 
second letter. Unlike the first one, it 
was prominently labeled, “Confiden¬ 
tial Attorney-Work Product.” This let¬ 
ter (dated November 15) contained an 
outline of potential legal defenses 
against Wang’s suit, as Wang’s lawyer 
perceived them, and asked Balde to 
provide his opinion on various issues. 
It also said that this material “will assist 
your review of the material included 
in my November 14, 1990, letter.” 

Several telephone conferences fol¬ 
lowed. The lawyer later said that in 
them he disclosed Wang’s confidential 
infonnation to Balde, and that he made 
the confidentiality clear to Balde. Balde 
did not deny this, but said that he made 
no use of the material in the Novem¬ 
ber 15 letter. He considered the letter 
premature because it asked him to give 
opinions on specific litigation issues. 
But he did not want to give any opin¬ 
ions on the issues until he had made 
up his mind about the validity of the 
SIMM patents. If they were not valid. 


Alexandria's 
"rocket docket" 

Last summer, the US District 
Court for the Eastern District of 
"Virginia awarded Wang $3.3 mil¬ 
lion in damages against Toshiba 
and NEC for patent infringement. 
The case is of special interest be¬ 
cause it comes from the “rocket 
docket” of Alexandria, "Virginia. 
This court is becoming a fomm of 
choice for patent infringement liti¬ 
gation and other complicated 
cases that usually drag on for 
many years, because of its firm 
policy that all cases must he tried 
within six months after joining is- 
.sue. Presumably, one can expect 
the same kind of rulings about 
experts in other cases brought in 
that court. 


he did not want to become involved 
with Wang. Balde felt that Wang’s law¬ 
yer was jumping the gun on enlisting 
Balde in Wang’s camp before he felt 
ready to sign up. Nevertheless, as the 
judge pointed out in his opinion, Balde 
did not write back to Wang’s lawyer 
saying any of this. 

Balde studied the matter and con¬ 
cluded that the SIMM patents were in¬ 
valid. On December 10 he telephoned 
Wang’s lawyer, told him his conclu¬ 
sion, and said that he therefore did not 
want to act as a consultant for Wang in 
the case. The lawyer asked Balde for a 
report, which Balde sent two days later. 
In his cover letter for the report, Balde 
wrote, “As you know, I have read ... 
the Work-Product information on the 
two Wang SIMM patents,” and he 
thanked Wang for offering to pay his 
$1,500 invoice for time spent on the 
task. Subsequently, NEC retained Balde 
as its technical expert in the case. Wang 
then moved to disqualify Balde. 

Court’s decision. The court found 
little precedent about disqualification 
of experts, but held that it had an in¬ 


herent power to disqualify them in ap¬ 
propriate cases. This power exists to 
help the court fulfill its judicial duty to 
protect the integrity of the adversary 
legal process and promote public con¬ 
fidence in the fairness and integrity of 
the legal process. 

The court found two issues: 1) Was 
it reasonable for Wang to think it had 
a confidential relationship with Balde? 
2) Did Wang disclose confidential in- 
fomtation to Balde? If and only if both 
answers are affirmative, Balde should 
be disqualified. It also recognized that 
lawyers might seek to disable poten¬ 
tially troublesome experts merely by 
retaining them without using them, and 
said that it would not countenance that 
ploy. (Here, Balde had been NEC’s 
consultant in the pa.st.) On the other 
hand, consultants who do not want to 
be bound by a duty of confidentiality 
should make it clear that they do not 
want to be retained until they make 
up their minds, and in the meantime 
they should not accept confidential 
disclosures. 

In this case, the November 15 letter 
carried the day for Wang. The letter 
was captioned “confidential attorney- 
work product,” and an examination of 
the enclo.sures confimis that descrip¬ 
tion. The court said, “No experienced 
litigator would freely disclose these ma¬ 
terials to opposing counsel.” Balde’s si¬ 
lence in the face of the November 15 
letter established that Wang’s lawyer 
was reasonable in thinking that a con¬ 
fidential relationship existed, and in fact 
that letter transmitted confidential in¬ 
formation to Balde. 

Advice. Neither Wang nor Balde had 
“acted inappropriately,” the court 
found. Even so, it decided to offer some 
free advice for others “to avoid a rep¬ 
etition of these unfortunate circum¬ 
stances.” First, lawyers should make it 
clear and confirm in writing that a con¬ 
fidential relationship will be created. 
Preferably, this should be specifically 
explained in a letter, along with con¬ 
firmation of payment terms and condi¬ 
tions. These elements were missing in 
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this case. Nevertheless, the court dis¬ 
qualified Balde. 

Next, consultants should take care 
to avoid creating confusion about their 
position. Doubts about wanting to be 
retained should be unequivocally ex¬ 
pressed. Here, the court said, “given 
his stated [read that as 'alleged’] con¬ 
cerns” about patent invalidity, Balde 
needed no more than the identity of 
the parties and the patent numbers. 
(The last is a bit of judicial overkill. 
Since the patents are public documents, 
their text is no secret. It would be an 
inconvenience to make Balde get his 
own copies.) He should have declined 
to accept anything more, the court said. 

Counsel should ask the consultant 
whether his past employment creates 
any problem. Since NEC apparently did 
not make this inquiry of Balde before 
retaining him for the case against Wang, 
the court felt that NEC had only itself 
to thank for its problem of having no 
expert on the eve of trial. (Presumably, 
Wang, too, should have asked Balde 
whether his past work for NEC gave 
him any of NEC’s secrets, which could 
make it improper for him to be Wang’s 
consultant.) 

Also. NEC should have promptly 
advised Wang and discussed the mat¬ 
ter “thoroughly in an effort to resolve 
the dispute before it is raised in court.” 
(To which I say, lots of luck. Why 
would Wang agree to roll over on any¬ 
thing? Excess of gentlemanlincvss? When 
was the last time you heard of noblesse 
oblige being a significant factor in how 
litigators behave?) 

To be sure, the court said, experts 
“are not advocates; they are sources of 
information and opinions in technical 
[matters] ... Yet, when experts are re¬ 
tained in connection with litigation, 
they must operate within the constraints 
of, and consistent with, the adversary 
process.” 

I am not going to express any opin¬ 
ion on whether Wang’s lawyer sand¬ 
bagged, hoodwinked, or otherwise 
mistreated Balde and NEC, nor on 
whether Balde got a deserved come¬ 


uppance. The court apparently felt that 
none of that happened, which is, of 
course, quite good enough for me. But 
the lawyer certainly outmaneuvered the 
EE in this case. 

Protecting yourself. So where dcx^s 
all of this leave other EEs? What do 
you do to “avoid a repetition of these 
unfortunate circumstances”? Or, exer¬ 
cising 20-20 hindsight, what should 
Balde have done? What would the CYA 
Manual for Electronic Engineer and 
Computer Science Would-be Expert 
Witnesses\rxwe prescribed, if anything? 
Remember, Balde was not trying to 
scare Wang away. He didn’t know 
whether NEC would want to retain him 
in this case, and he had to pay the rent 
every month. He also did not know, 
presumably, how he felt about the 
SIMM patents and Wang’s case. 

... consultants ... 
could be 
"Balde-ized." 

Ideally, Balde would have had his 
own lawyer, whom he would have 
consulted immediately after the No¬ 
vember 14 telephone call. He should 
have turned the November 14 and 15 
letters over to a lawyer, unread and 
preferably unopened. Balde’s lawyer 
would have responded for him or ad¬ 
vised him how to respond. He would 
have matched Wang’s lawyer’s efforts 
with the same thing in reverse, some¬ 
thing like the following: 

E, 11 & D, Counselors-at-Laiv 
Nocemher 16, 1990 

Dear Sir: 

My client, Douhle-E Balde, has 
turned over to me your letten of No¬ 
vember 14 and 15 and enclosures, 
which he has not read. My client wishes 
you to he advised that he is not yet fully 


prepared, at this time, to enter into a 
relationship of confidentiality with re¬ 
gard to the subject matter of Wang v. 
Toshiba and NEC. In all fairness to 
Wang, my client feels that he should first 
make his own preliminary evaluation 
of matenals already of public record, 
or othenvise not secret or confidential, 
in order to satisfy himself that he can 
appropriately consult with, or act as an 
expertJor, Wang in this matter. lam 
sure you will appreciate that it would 
not be in Wang's or his interests to have 
him prematurely enter into a confiden¬ 
tial relationship with Wang, in the event 
that it later appears he tvould be obliged 
to testify as to opinions inconsistent ivith 
Wang's position in this matter. 

We will get back to you as soon as 
Mr. Balde has Jhiished the foregoing 
preliminary evaluation. In the interim, 
Itvill retain the documents, unless you 
tvish me to return them to you at this 
time. 

Sincerely yours, 

E] U,&D 

Of course, you may feel it is not fea¬ 
sible to run a consulting business this 
way. What then? You consultants who 
don’t want to run your business 
through lawyers still must do some¬ 
thing, or you could be “Balde-ized.” At 
the very least, you should send a let¬ 
ter, such as the following, that nega¬ 
tives a confidential client relationship 
like the one set up for Balde by the 
lawyer’s letter. 

Bit Bucket Consulting Services, Ltd. 

November 16, 1990 

Dear Counselor: 

Thank you for your letters of Novem¬ 
ber 14 and 15, 1990, regarding the 
pending patent infringement suit be- 
tiveen Wang and Toshiba/NEC regard¬ 
ing SIMM technology. lam responding 
to both letters at the same tune, because 
your second letter arrived before I could 
continued oti p. 72 
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R&D in Japan 


hese two reports and two announcements 
reflect some of the ongoing research in 
this country. 

Forecast for 2010 

Japan's Economic Planning Agency enlisted the 
assistance of a group of 10 experts to assess the 
country’s position in technology and its direction 
for the next 20 years. After listing 101 technologi¬ 
cal items, the study group developed a question¬ 
naire concerning them, combined the responses, 
and wrote a lengthy report and summary (in Japa¬ 
nese). The summary, released in July 1991. 
proved fascinating in the detailed views of the 
group members but produced .some problems for 
readers. For example, the study may not have 
much statistical validity, given that one or at most 
two experts as.se,ssed the individual items. In ad¬ 
dition, parts of the textual material were awk¬ 
wardly phrased. 

But to me, the mo.st interesting portions of the 
report appear in its tables. Table 1 lists .selected 
items from the report’s tables. 

Micromachines 

Under Japan’s National Research and Devel¬ 
opment Program (popularly known as the Large- 
Scale Project), industry, government, and 
academic circles cooperate on re.search and de¬ 
velopment of innovative, advanced, large-scale 
industrial technologies deemed important and 
urgent for the national economy. Since the pro¬ 
gram began in 1955, 29 projects have been 
launched, eight of which continue today. Mm 
(Japan’s Mini.stry of International Trade and In¬ 
dustry) proposed an R&D program called 
Micromachine Technology to begin in fiscal year 
1991. 

The New Energy and Industrial Technology 
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Development Organization (NEDO) under the 
authority of the Agency of Indu.strial Science and 
Technology will conduct the Micromachine Tech¬ 
nology project. NEDO plans to e.stablish the tech¬ 
nologies necessary for the realization of 
micromachines. 

Elemental technologies. Tlie first step of the 
program (covering the first four to five years) 
emphasizes the e.stablishment of basic technolo¬ 
gies of elemental components for micromachines 
and targets four major R&D items. The first major 
item in this pha.se will e.stablish technologies for 
material processing and control of dynamic 
mechanisms by developing the elements of a 
inicromachine. The elements are actuators (ther¬ 
mal-, electrical-, or magnetic-induced defomta- 
tion, electro.static, hydraulic) and sensors 
(electromagnetic, chemical). 

A second item calls for the development of 
technologies that transfomr energy from external 
sources, micro internal batteries, or micro power 
generators. A third will inve.stigate micromachine 
communication with the exterior world and con¬ 
duct theoretical research and associated software 
development of remote control and coordinated 
distributed control. 

A final major item concerns the measurement 
of accuracy and movement of the micromachines 
for evaluation of the results of the R&D program. 

R&D description. 'iJiough micromachines are 
small, they are complex .systems with advanced 
functions. Some of the areas that need to be re¬ 
searched include the basic theories that underpin 
miniaturization methods, .stmctural analysis, ma¬ 
terials, and component technologies of micro¬ 
scopic processing and assembly. Other areas 
include the techniques for producing microscopic 
.sensors and control circuits, and the sy.stem tech¬ 
nology required to perform microscopic motion 
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Table 1. Identified technologies and application year. 

Technology 

Year 

Information electronics 


Biosensor 

2000 

Superparallel computer 

2010 

Terabit optoelectronic file 

2010 

Superintelligent chip 

2010 

Terabit optocommunication device 

2010 

Automatic translation systems 

2020 

Virtual reality system 

2020 

Self-replicating database system 

2020 

Neurocomputer 

2030 

Terabit memory 

2030 

Self-replicating chip 

2050 

New materials 


Ceramics gas turbine 

2000 

Magnetic materials 

2010 

Hydrogen occlusion alloy 

2010 

Optical IC 

2010 

Superlattice devices 

2010 

Nonlinear optoelectronics 

2020 

Superconductors 

2030 

Automation 


Micromachines 

2010 

Concurrent engineering 

2010 

Intelligent CAD 

2020 

Communication 


TV conference system 

1994 

TV telephone 

1994 

Optoelectronic LAN 

1995 

Broadband ISDN switches 

1995 

HDTV 

1995 


and operation. Table 2 on the next page ! veloped in the 1970s and enabled the 


lists some of the main topics envisioned 
for micromachine technologyt 

The technology. Microscopic ma¬ 
chines or instruments with advanced 
functions can perfonii minute tasks or 
work in extremely naiTow spaces. Their 
small size allows them to be applied to 
a wide variety of areas, including medi¬ 
cine. biotechnology, and industry. 

Recently, silicon micromachining 
technology, in which micrometer me¬ 
chanical slRicaires are fomied on sili¬ 
con wafers, opened up this field. The 
technology emerged from etching, 
deposition, and other lithographic tech¬ 
niques for microprocessing silicon de- 
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production of cantilevers, diaphragms, 
and other simple mechanical compo 
nenLs. These products are now used 
widely as pressure sensors or are begin¬ 
ning to commercialized as accelera¬ 
tion and flow' sensors. Moreover, recent 
advances in semiconductor 
microprocessing and ultraprecision 
processing technology allow the cre¬ 
ation of mechanical parts far smaller 
than anything previously developed. 

Researchers only recently Ix^gan to 
investigate micromachine technology, 
however, and need to overcome many 
technological baniers before such ma¬ 
chines can be developed. Some barriers 


are those related to friction, durability, 
strength, materials, and power sources 
and supplies. Researchers also must de¬ 
velop a number of other technologies, 
including ways to design, process, as¬ 
semble, and control micromachines. 

Micromachines today. We can ap¬ 
proach micromachine dex^elopment in 
tw'o waiys. One approach uses technol¬ 
ogy in the field of mechanical engineer¬ 
ing. which makes existing mechanisms 
even smaller; the other uses Micro 
Electro Mechanical Systems (MEMS) 
technology, w'hich uses IC production 
technology. Many researchers have 
proposed or built prototypes of various 
microactuators and microstructures that 
could serve as the component tech¬ 
nologies for micromachines. 

In the United States, schools like 
MIT. Stanford, and the universities of 
Wisconsin-Madison, Michigan, and 
California, Berkeley continue to 
research surface micromachining 
technology and LIGA (Lithografie, 
Galvanofomiung, Abfonnung) process 
technology in their silicon and LIGA 
process research centers. 

MIT researchers produced a 100-pm- 
diameter polysilicon micromotor rotat¬ 
ing at 15,000 rpm and analyzed its 
movement. Researchers created the 
motor using the same process used for 
IC production. Wisconsin-Madison re¬ 
searchers produced 319 stmctures con¬ 
taining gears with 55-jam inside 
diameters. Researchers at Berkeley pro¬ 
duced a 120-|am-diameter electrostatic 
motor and verified that it does rotate. 
They can also measure friction coeffi¬ 
cients, one of the most difficult prob¬ 
lems in micromachine technology, 
using a micro electrostatic linear 
actuator. 

The US National Science Foundation 
supports these efforts. In 1988 NSF dis¬ 
tributed its micromachine research 
budget among eight universities and 
in 1989 provided funding to 11 
universities. 

Europe also supports several micro¬ 
machining facilities, including Ger¬ 
many’s Fraunhoferinstitut, Techische 
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Universitat Berlin, Kernforschungs- 
zentnim Karlsruhe Institut fur Mikro- 
■stnikturetechnik, and the Netherlands 
university of Twente. Re.search concen¬ 
trates mostly on sensors. The 
Fraunhofer Institut fur Mikrostruk- 
turetechnik prtKiuced prototypes of a 
vibration .serrsor with 32 cantilever me¬ 
chanical resonators and a 1,5-mm x 
1.25-mm cantilever thermal bimorphic 
microactuator. These facilities receive 
subsidies both from the European Eco¬ 
nomic Community and from their re¬ 
spective countries. 

Japan pursues numerous creative 
studies related to micromachine tech¬ 
nology. For example, the University of 


Tokyo developed prototypes of a micro 
Stirling engine with high thermal effi¬ 
ciency' as a microactuator, Tohoku Uni¬ 
versity' produced a microvalve using 
silicon. NTT Applied Electronics Ltibo- 
ratories produced a 500-nm x 500-nm, 
active integrated optical microenctxler 
with 0.01-pm resolution. 

Research at many universities, na¬ 
tional research institutes, and private 
companies continues to produce proto¬ 
types. Ibe.se include electrostatic linear 
actuators, micropre.ssure .sensors, micro 
IS-FTT (ion-sensitive field-effect transis¬ 
tor) sensors, micromanipulators using 
piezoelectric-impact drive sy.stems, mi¬ 
cro active catheters using shape 


Table 2. Main topics of micromachine research. 

Technologies 

Description 

Microscopic mechanical device 

R&D on structures, materials, machining 
techniques, integration techniques, 
and power supplies for microscopic 
mechanisms and functional 
components required for 
micromachines 

Development of technology to enable 
the production of various mechanical 
devices 

Microscopic sensors, control 

R&D in the technologies needed to 

circuits, and other techniques for 

produce extremely miniaturized 

miniaturized electronic devices 

electronic devices such as microscopic 
sensors and control circuits used in 
micromachines 

Control and operation 

R&D in motion control and operation 
technologies for microscopic 
mechanisms 

Measurement and evaluation 

Basic research into measurement 
methods, evaluation methods, and 
microscopic measurement technology 
as they relate to various component 
devices 

Support 

Basic research into support technologies 
including lubrication techniques for 
microscopic parts, theoretical 
simulation, and CAD/CAM 

System integration 

R&D project extending 10 years 
(FY1991-FY2000) and costing 
approximately 25 billion yen 


memory alloy (SMA), and more. 

These technologies, which mainly 
use semiconductor prcrduction tech¬ 
niques, represent only a small part of 
micromachine production technology. 
Research in this field is still in its in¬ 
fancy, and many problems remain to be 
solved. Some of the hurdles to be over¬ 
come include the development of pro¬ 
duction and process technologies 
geared specifically toward micro¬ 
machines and solving questions related 
to friction, durability, .strength, materi¬ 
als, power supplies, and control. 

Application of results. Since 
tnicromachine technology will have a 
variety of applications, the program will 
focus on common component tech¬ 
nologies. Once developed tliese areas 
probably will re.sult eventually in appli¬ 
cations like the following. 

Industrial micromachines. Industry 
faces the need to boost reliability and 
reduce maintenance costs for ever 
more-advanced and complex mechani¬ 
cal .sy.stems and equipment (power 
plants and airplane engines are two 
good examples). A tremendous need 
exi.sLs for technology that makes it pos¬ 
sible to perform inspections and repairs 
in extremely tight spaces without hav¬ 
ing to dLsmantle the entire system or 
equipment in que.stion, such as plant 
pipe .sy.stems and airplane engines. 

Indu.strial micromachines will enable 
inspection and repair without requiring 
that plant equipment be dismantled. 
This capability will make it po.ssible to 
perfomi early inspection and repair to 
minimize the extent of damage. We can 
therefore expect significant improve¬ 
ments in capacity utilization and main- [ 
tenance co.sts for electric power plants 
and other facilities. 

Medical micromachines. Today's 
medical procedures do not sufficiently 
alleviate the pain experienced during 
diagnosis and treatment. In addition, as 
the population ages, we will see a 
strong demand for advanced medical 
equipment that les.sens the physical and 
mental .sti'e.ss inflicted on patients. 

continued on p. 87 
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The Scalable Coherent Interface and 
Related Standards Projects 


The Scalable Coherent Interface (IEEE PI596) provides bus services by transmitting packets on 
a collection of point-to-point unidirectional links. Its protocols support cache coherence in a 
distributed shared-memory multiprocessor model, with message passing, I/O, and LAN com¬ 
munication taking place over fiber optic or wire links. Several ongoing SCI-related projects 
apply the SCI technology to new areas or extend it to more difficult problems. 


David B. Gustavson 

Stanford Linear Accelerator 
Center 



he Scalable Coherent Interface (SCI) 
was developed by a number of bus 
designers and system architects who 
had come to understand the funda¬ 
mental limits to bus technology during their work 
on Fastbus (IEEE 960) and Futurebus+ (IEEE 
896.x). These modem buses push bus signaling 
technology to its limits and provide various ar¬ 
chitectural features that support the use of mul¬ 
tiple processors. 

These bus limits are rapidly becoming a seri¬ 
ous problem as the demand for computing power 
continues to grow. The economic reality is that 
we can only meet this demand by using a large 
number of fast microprocessors. Buses, however, 
are inherently a bottleneck (only one transfer at 
a time), and their signaling speed is limited by 
the imperfect transmission lines that result from 
bus-style connections. 

Therefore, buses can’t support a large number 
of processors, especially not fast ones. While we 
can extend their useful life a bit by cleverness 
and brute force, the potential gains are relatively 
small and the costs become very high. For ex¬ 
ample, doubling the width of a bus does not 
double its speed because there are fixed over¬ 
heads associated with arbitration and addressing. 
Lengthening block transfers to reduce the effect 
of these overheads is of little use once the blocks 
exceed the size of cache lines. 


We can increase signaling speeds by shorten¬ 
ing the bus. but that makes it less useful. Reduc¬ 
ing the signal voltage helps, but eventually this 
solution experiences noise problems. Using mul¬ 
tiple buses to achieve more than one transfer at a 
time results in a complex (expensive) bus-bridge 
mechanism to maintain cache consistency (co¬ 
herence) in shared-memory systems that use bus- 
snooping technology. 

Paul Sweazey (Futurebus cache-coherence task 
group coordinator) started the Superbus Study 
Group in November of 1987 to see if there were 
potential solutions to these problems. In July 1988 
the outline of the solutions had become clear, 
and the PI596 SCI working group replaced the 
study group. The work was essentially completed 
by January 1991. when specification draft Dl.OO 
went out for ballot. Since then, it has been un¬ 
dergoing minor improvements, polishing, and 
debugging of the specification C code. (Most of 
the SCI specification is executable, to reduce 
ambiguity, simplify testing, and enable accurate 
simulation.) 

The resulting draft D2.00* recirculated to the 
balloting Ixxly in December. Since the draft passed 
with 92 percent affirmative, and all but one ob¬ 
jection has been resolved, (we refused to change 
the C code to Pascal), final approval by the IEEE 
Standards Board seems probable in March 1992, 
unless new objections arise. 
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SCI goals 

Tlie SCI design goals include 

• Scalability, so that the same mechanisms can be used in 
high-volume, single-processor (or few-processor) sys¬ 
tems such as one might find in desktop machines, as 
well as in large, highly parallel multiprocessors (next- 
generation supercomputers); 

• Coherence, to support the efficient use of cache memo¬ 
ries in the most general and easiest-to-use multiproces¬ 


sor model, distributed shared memoiy; and 
• An interface, a standardized open communication ar¬ 
chitecture that allows products from multiple vendors to 
be incorporated into one .system and interoperate 
smoothly. 

Scalability keeps costs down, not only through increased 
volume of production but through the simplicity of having to 
learn only one new paradigm—one that will work over sev¬ 
eral generations of machines. (See box.) 


SCI applications 


SCI uses point-to-point links to achieve very high speed 
communication. For the highest performance over short 
distances (typically within a cabinet), l6-bit-wide links run 
at 1,000 Mbytes/s. For I/O applications within a room, serial 
coaxial cable links run at 1,000 Mbps. For I/O over cam¬ 
pus distances of a few kilometers, optical fibers can carry 
the same serial bit stream. See Figure A. 

SCFs scalable architecture allows the .same protocols 


to cover the range from internal communication within a 
multiprocessing supercomputer to local area network ap¬ 
plications. LAN communications look like moving data 
from one address to another in memory, a very simple 
software model. However, wide-area networks need hi¬ 
erarchical addre.ssing in.stead of SCFs flat 64-bit address 
model, .so the usual software protocol translations are 
necessary'. 
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SCI 


A standard SCI module defines signals, connector, and 
power for operation at a link speed of 1,000 Mbytes/s. A 
standard cable and connector defines the use of these signals 
for applications in which the module form factor is not ap¬ 
propriate, over short distances (meters). A standard fiber¬ 
optic serial interface solves interface problems over longer 
distances (kilometers) at a link speed of 1,000 Mbits per sec¬ 
ond. The same bit stream may be sent over coaxial cable at 
lower cost for medium distances (tens of meters). We plan to 
standardize other speeds and signaling in the future as 
appropriate. 

Solving the bus signaling and bottleneck problems. 

SCI needed two fundamental changes from the way a bus 
transmits information. First, to make signaling speed inde¬ 
pendent of the size of the system, interfaces don’t wait for 
each signal to propagate (that is, a bus cycle) before sending 
the next signal. Each communication takes place by sending 
packets that include an address, command, and data as 
needed. While the propagation velocity of a packet is still 
limited by the speed of light, its rate of information transfer is 
not. As technology advances, the transfer rate can increase 
indefinitely. Fortunately, computer scientists know techniques 
that can compensate for the bad effects of delay (latency), 
but little can be done to compensate for a transfer rate (band¬ 
width) that is too small. (See Transaction phases box.) 


Secondly, SCI uses multiple signal paths (links) so that 
multiple independent transfers can take place concurrently. 
For high perfomiance, designers can use separate links for 
each processor, memory, or I/O device. 

We further refined the SCI link design by applying lessons 
learned in practical multiprocessor systems. The link signals 
are differential because differential signals produce the least 
system ground noise and are least sensitive to noise from 
other sources. The links are unidirectional because bidirec¬ 
tional links create noise when the drivers are turned off or on 
to reverse the direction of the link, and because turn-around 
delays increase with cable length, a scalability problem. The 
links are fast and narrow because we expect pins will always 
be relatively expensive. 

SCI does not use reverse-direction flow control signals 
because such mechanisms make the amount of buffer stor¬ 
age needed in the interfaces dependent on cable length. To 
reach high speeds, we use low-voltage differential signals. 
SCI initially uses l6-bit-wide, ECL-compatible signals l')ecause 
of the industry experience with and support for that stan¬ 
dard. Future links will probably use even lower voltages that 
are chosen for compatibility with VLSI CMOS or GaAs circuitry. 

Thus each SCI interface (node) has (at least) two links, one 
incoming and one outgoing. The links am continuously, send¬ 
ing idle symbols when no packets are being transmitted, so 
that the receiver can remain perfectly synchro¬ 
nized at all times, ready for action. SCI pack¬ 
ets do not need the prologue that is essential 
for Ethernet or similar networks. 

In a high-performance SCI system, a ven¬ 
dor-dependent switch accepts packets from 
nodes and routes them to the appropriate other 
ncxles as specified by the address in the packet. 
SCI does not specify the details of such a 
switch, because many cosi/performance trade¬ 
offs exist. 

Lowest cost SCI systems, such as desktop 
systems, typically will use a ring connection 
instead of a switch, connecting one node’s 
output link to its neighbor’s input. The deci¬ 
sion to sLippoa rings made the interface cir¬ 
cuit more complex because a node may 
receive a packet intended for some other node 
and have to pass it along. That process re¬ 
quires some buffer memory and some address 
recognition logic. However, the result is that 
one interface definition works over a very wide 
range of applications, increasing the produc¬ 
tion volume and lowering the costs for every¬ 
one. (See Node interface structure.) 

Point-to-point signaling is much easier than 
bus signaling. This, in combination with SCFs 
low-current, single-voltage (48V) power dis- 


Transaction phases 

As seen in Figure B, SCI transactions handshake on a packet basis 
rather than on a bus cycle basis. A request packet contains all the 
necessary address, command, and possibly data needed to initiate a 
transaction. Tlie packet is either accepted and stored in the responder's 
queues or (if there is insufficient space) is discarded. An echo tells 
the requester whether it can discard its send packet or must retransmit 
later because it was not accepted. A similar handshake occurs on 
the response. 
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Figure B. Transactions have requests and response subactions. 
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Node interface structure 


Data arriving on the incoming link shown in 
Figure C have to be resynchronized to the node’s 
own clock. The receiver’s elastic buffer circuitry' 
handles this step. The rest of the node circuitry' 
is synchronous, which greatly simplifies the de¬ 
sign of the ultrafast FIFOs and logic. 

If a node receives a packet intended for some 
other node while it is transmitting a packet of its 
own, part or ail of the incoming packet moves 
to the bypass FIFO. 

When no packet is being received, idle sym¬ 
bols keep the receiver synchronized and carry 
information about the priorities of other nodes. 
They also carry Go bits, which act like tokens 
and help ensure fair use of the links. Fair use of 
part of the bandwidth is important in avoiding 
starvation or deadlock. 

An SCI node maintains dual queues to keep 
responses independent of requests. Without dual 
queues, excessive requests could prevent the 
sending of responses, resulting in deadlock. 



Figure C. Block diagram of a typical SCI node. 


tribution system, should make ring-connected backplanes very 
inexpensive. A few printed-circuit-board layers will generally 
be enough. 

Intemiediate-level SCI systems may use a combination of 
switch and ring, using paired SCI interfaces as a simple bridge 
that connects two rings. By combining many rings, designers 
can build a distributed switch fabric. Active research is in 
progress to discover optimal configurations.^*^ Scott and 
Goodman^ show that when pipelining is pemiitted, as is the 
case for SCI, we gain performance rapidly relative to syn¬ 
chronous systems by increasing the dimensionality of the 
interconnect. The resulting perfomiance is much higher than 
for packet-synchronous systems. 

SCI defines efficient packet-based protocols that provide 
the kinds of services we expect from a computer bus. The 
main differences seen by the user relate to the separation of 
the request for service from the response. A simple bus waits 
during a memory read access time, until it gets the data. No 
other parties can use the bus during that wait. This approach 
makes it conceptually easy to handle error conditions or to 
perform complex mutual-exclusion operations (like read- 
modify-write). See the Transaction fomiats box. 

More sophisticated buses split the operation into a request 
and a response phase, just as SCI does. Then the interface 
must keep track of pending requests to match the responses 
to them, and a different style of mutual exclusion becomes 
necessary. Engineers who have already used split-response 


Transaction formats 

SCI packets have a l6-byte header that contains ad¬ 
dress, command, transaction identifier, and (in a re¬ 
sponse) status information. All packets that may need 
storage space in queues are multiples of 16 bytes, to 
simplify storage management at very high speeds. The 
echo packet is an 8-byte subset of the header (not shown 
in Figure D); it is never stored in queues. A few transac¬ 
tions need an extended header (not shown either), which 
adds another 16 bytes. 


readxx* 

writexx* 

movexx* 

locksb 


Request 


Header 


Header 

16,64,256 

Header 

0,16,64,256 

Header 

16 


Response 


Header 

0,16,64,256 

Header 



Header 


16 


* XX represents one of the allowed data block lengths 
(number of data bytes, on the right after the header). 


Figure D. SCI packets. 
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SCI 


buses will find SCI to be clean and simple; those who haven’t 
may face a learning curve. 

Solving the cache coherence problem. Cache memo¬ 
ries are impoitant for keeping processors mnning at full speed, 
but they introduce some system management complexity. 
The various caches often contain duplicate copies of data 
that must be kept consistent with one another, the cache 
coherence problem. Say that two processors read the same 
data (perhaps a synchronization flag) from memory and cache 
it (perhaps on chip), and then one changes the data (per¬ 
haps to free a resource that the other is waiting for). Some¬ 
how the other processor must discover that its cached copy 
is invalid so that it can be updated with current information. 

Designers have maintained cache consistency, or coher¬ 
ence, in small bused systems by taking advantage of the bus 
bottleneck—every cache controller obseives every transac¬ 
tion in the system, “snooping” to catch transactions that might 
invalidate cached data. That approach can be scaled up to a 
few buses by making the bridges rather sophisticated, but it 
does not work in highly parallel systems. 

A cache coherence scheme that scales to large systems 
requires a directory that keeps track of which data are being 


used by which caches, so that the appropriate caches can be 
updated as necessary. Earlier schemes used a directory but 
kept it in memory. Instead, SCI maintains it as a distributed, 
doubly linked list of caches, with the head pointer at a memory 
controller and the link pointers stored in the cache control¬ 
lers. With this approach we know that the correct amount of 
storage is always available for the directory stmcture, no matter 
how many caches are sharing copies of a particular line of 
data. This approach also spreads the maintenance traffic across 
the system, rather than concentrating it at the memory. Even 
though these linked-list structures are shared, we designed 
them to be updated concurrently by multiple processors with¬ 
out any semaphores or lock variables. The protocols use in¬ 
divisible compare-and-swap transactions instead. (See the 
Distributed cache tags box.) 

We can avoid the cache coherence problem if memory is 
not shared. Most of the early multiprocessors, which rely on 
message passing and explicit interprocessor communication, 
use this approach. However, it is often difficult to move com¬ 
puter programs from single-processor machines to message¬ 
passing multiprocessors. Note that shared-memory machines 
can easily pass messages, so they are more versatile. The 


Distributed cache tags 

SCI is a distributed system with many links carrying 
data independently and concurrently. Only directory- 
based cache coherence is practical in large high-perfor¬ 
mance systems of this kind. The directory keeps track of 
which caches are sharing each cache line of data, so that 
the right caches can be notified when their copy be¬ 
comes invalid. Rather than centralizing the coherence 
directory (in RAM), SCI distributes it among those cache 
controllers that are sharing the data. 


Processors 

Head Mid Mid Tail 



E unit 
Cache 


Figure E. A typical linked list for one cache line. 


The SCI coherence directory always has storage avail¬ 
able in exactly the amount needed, because tlie cache con¬ 
trollers sharing the given cache line provide it. The arrows 
in Figure E represent bidirectional pointers forming a dou¬ 
bly linked list. These pointers and a few status bits consti¬ 
tute the directory entry for this cache line. 

The doubly linked list structure makes it possible for 
any cache to roll out a cache line to make room for new 
data, removing itself from the list by telling its neighbors to 
point to each other. This list maintenance traffic is distrib¬ 
uted throughout the system, greatly reducing the concen¬ 
tration of traffic that is typical of centralized directory 
schemes. 

The protocols support optional optimizations for impor¬ 
tant cases like pairwise sharing and serial resource alloca¬ 
tion. Pairwise sharing lets two sharers pass data back and 
forth as needed without involving memory. Serial resource 
allocation (queue on lock bit) uses the list structure to pass 
the resource along without needless communication in the 
interconnect. 

Note that the directory entry is a shared data structure 
that may be concurrently accessed by multiple processors. 
The SCI protocols rely on atomic compare-and-swap op¬ 
erations to ensure correctness without using semaphores 
or lock variables, which would be less efficient and intro¬ 
duce management complexity. For example, what do you 
do when a processor sets a lock variable and then dies? 
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Cache coherence 


Until recently, microprocessor designers considered the 
use of a cache to be their own decision, based on cost/ 
perfonnance objectives, with little concern for the needs 
of multiproces.sors. With just one processor, the existence 
of a cache usually affects its perfonnance but not the cor¬ 
rectness of its executitm. A common exception occurs in 
the case of self-modifying code, where the processor ex¬ 
ecutes instructions from the cache that are writing to 
memory in a futile effort to modify themselves. 

However, in a system with two or more processors each 
of which has a cache, serious problems can result if the 
caches are not properly designed to work in hannony. For 
example, suppose two processors use a variable to keep 
track of which one is using the printer, and the software 
waits until no one is using it before starting a new print 
job. If two primitive processors read the same variable and 
each keeps it in its own cache, they each read only their 
cached copy, not any changes that the other processor 
may make to the variable. Designers must either arrange 
that such variables are never stored in caches or design the 
cache control so that each cache discovers when its copy 
of the variable is no longer correct, discards the old (in¬ 
valid) value, and obtains a fresh copy. 

Such mechanisms add co.st and complexity, so it is not 
surprising that they were omitted in early caching 
(uni)processors. We call keeping all the cached copies of a 
data item consistent the cache coherence problem. 

The cache controller can detemiine that its copy is in¬ 
valid either by tracking every transaction in the system that 
might be capable of changing the data or by being notified 


by a reliable source. Bused multiprocessors usually use 
snooping to maintain coherence. Each cache monitors the 
bus at all times, which keeps it fairly bu.sy since it also has 
to be handling the memory reque.sts from its processcrr. 
When the cache controller identifies another proce.ssor writ¬ 
ing to the addre.ss of data it has cached, it can either pick 
up the new value on the fly (“snarling’') or mark the old 
one invalid and go to memory for a good copy when (and 
if) the processor needs it later. Say a cache can verify that 
it has the only good copy—because it changed the data— 
and it sees another processor trying to read the data from 
memory. It must then intervene to prevent the other cache 
from accepting a .stale copy. Either the cache can supply 
the data it.self or it can abort the transfer, update memory 
with the latest copy, and then let the transfer be retried. 

The snooping mechanism relies on every controller be¬ 
ing able to track every transaction. That’s acceptable on a 
bus, because only one tramsaction can happen at a time, 
and each controller tracks it. With effort and care, snoop¬ 
ing can even work in a system with several buses. But in a 
highly parallel .system, like SCI, the cost of snooping is 
intolerable; too many transfers happen at the same time 
and in different places. 

The usual solution to this problem is to maintain a direc¬ 
tory that tracks which caches have copies of each cache 
line. (A cache line is the amount of data tracked as a single 
entity by the cache controller. SCI tracks 64-byte lines; ear¬ 
lier .systems usually used smaller line sizes.) 

The next decision is where to keep the directory and 

continued on p. 16 


goal, then, is to make the coherence protocols efficient and 
economical. 

See the Cache coherence box for further amplification of 
the problem. 

Multiprocessor issues. SCI designers placed a high pri¬ 
ority on eliminating several architectural problems, such as 
deadlocks and livelocks, that have plagued past multiproces¬ 
sors. Deadlocks can occur when two processors request the 
same two resources in the opposite order. Each processcrr 
may access one resource (blocking the other processor) then 
becttme blocked by failure to access the other resource. 
Livelock, or starvation, occurs when one processor repeat¬ 
edly acce.sses a shared resource without letting another have 
its aim for an arbitrarily long time. Deadlock and livelock are 
examples of what we generically refer to as “forward progre.ss” 
issues. To guarantee forward progress, each operation should 
result in some useful work done, so that the system will not 


consume all its resources performing useless retries. 

Experience with multiproce.ssor systems shows that they 
tend to .synchronize themselves around problems like these. 
Even though the designer has estimated the probability of 
livelock to lx; extremely small, its effect on the .system once it 
occurs is such that it tends to cau.se frequent repetitions. 

Therefore, we designed the SCI protocol to eliminate de¬ 
pendencies that can cause deadlocks and to include a simple 
mechanism that assures fair allocation of resources to avoid 
livelocks. In a few ca.ses, deadlock avoidance seems to make 
the protocol less efficient (until we consider the indirect con¬ 
sequences!). But in mo.st cases choosing a clean protocol 
resulted in no penalty. 

Of course, software that is outside .SCI’s scope can still 
create deadlocks. The .SCI designers could only eliminate 
deadlocks from the underlying protocols. 

SCI includes efficient mechanisms for supporting shared 
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resources, shared data structures, and mutual exclusion. Since 
the old standby read-modify-write is impractical in a parallel¬ 
processing environment. SCI designers took a fresh look at 
the problem and incorporated several mechanisms that will 
be helpful. 

In a cache-coherent environment, the mechanism that guar¬ 
antees exclusive use of a cache line by a writer can handle 
mutual exclusion. The processor can temporarily lock an ex¬ 
clusive cache line while it performs any operations it desires 
then release it to other readers or writers. However, coher¬ 
ence may not be available in all cases. For example, when 
accessing (through a bridge) a bus that does not support 
cache coherence, one may need to tell the bridge to perform 
a read-modify-write. 

Therefore, SCI defines a set of lock primitives that experi¬ 
ence shows to be useful for a variety of purposes: masked 


swap, compare-and-swap, and fetch-and-add. Each of these 
primitives sends the command and necessary data to a desti¬ 
nation device (perhaps a bridge or a memory controller). 
That device performs the operation indivisibly, returning the 
result to the requester. This mechanism works well through 
switches or other interconnects. (It even works well on buses.) 

A particularly interesting use of the swap operations is in 
the maintenance of shared lists that hold information for an 
interaipt-servicing processor or commands for a DMA con¬ 
troller. If these lists are properly structured, multiple proces¬ 
sors can add items to them while one processor takes items 
off. without any need for lock variables. 

A clean way to handle interrupts is to associate a particular 
shared list with one priority of interrupt service and a specific 
bit in an intermpt-triggering control register. To request in¬ 
terrupt service, an I/O device or processor adds a work item 


Cache coherence (cominnedfwmp. 15 ) 

how to organize it. One way is to keep somewhere in 
memoiy^ a bit map for each cache line, setting a bit corre¬ 
sponding to a particular cache when that cache comes to 
the memoiy for the data. But in a large system, this would 
require too many bits. Using thousands of bits to account 
for the location of one cache line isn't acceptable. The 
next refinement might be to make a list in memor\^ keep¬ 
ing track of the identifiers of the caches that have taken 
copies. But w'hile most cache lines are sliared by perhaps 
only one or two caches, some might be shared by every 
cache in the system. Thus the storage for the worst case list 
becomes impossibly large. Designers have used a variety 
of clever compromises, such as allocating a small amount 
of list storage and then assuming the worst when it runs 
out. In that case, they assume that every cache in the sys¬ 
tem has a copy and must be notified when the data changes. 

SCI features a very general and scalable directory struc¬ 
ture. Memory controllers keep (and store in special memory) 
one node pointer and a few stale bits for each cache line 
they have. That pointer points to the head of a list. The 
address of the data in a read transaction routes the read to 
the memoiy and tells the memory which cache line is de¬ 
sired. The read transaction includes the node identifier of 
the requester. The memory controller exchanges that iden¬ 
tifier with its pointer for that cache line, returning the old 
pointer value to the read requester. If the memory has a 
current copy of the data, it returns the data to the requester 
too. Otherwise the memory informs the requester that the 
data is somewhere else, in another cache, and it can use 
the returned pointer to find it. 

SCI cache controllers keep in their tag memory storage, 
for each line, the memory address of the line (like all cache 


controllers have to do), a few state bits, and two pointers. 
The pointers form a doubly linked list of those nodes that 
have the particular line in their caches. 

Tliere are tv,'o particularly good properties of this scheme. 
First, it scales properly. No matter how many caches or 
how much memory or what the sharing behavior happens 
to be, exactly the right amount of storage is always avail¬ 
able. namely rtvo pointers per cached copy plus one pointer 
per cache line in memory. 

Second, maintaining this list is a distributed process. Only 
one transaction touches the memory for each request. All 
the rest (following pointers, adding itself to the list, remov¬ 
ing itself when its cache overflows and it needs to roll out 
a cache line to make more room) involve transactions 
among the distributed caches, which do not contribute to 
traffic congestion at the memory. 

Though the bus-snooping mechanism may seem con¬ 
ceptually simpler, keep in mind that it bears a high price. It 
prevents us from having thousands of transactions pro¬ 
ceeding at once. Furthermore, every cache in a snooping 
system must participate in every address cycle. Thus, ev¬ 
ery cache must be very fast so that participation does not 
slow the system too much, because the slowest one sets 
the speed for all. 

In the SCI scheme the cache controller acknowledges a 
packet’s arrival and checks it when it is convenient. If the 
controller is slow, it only slows the transactions in which it 
participates, not all transactions. As designers add new, 
higher performance, caches to a system, they get a propor¬ 
tionate perfomiance improvement. This is another desir¬ 
able kind of scaling behavior; designers don't have to discard 
old cache controllers to get the benefit of new ones. 
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Interrupts 


Interrupts are simple in a single-processor system, be¬ 
cause it is clear which processor should perform the re¬ 
quested service. 

Some simple systems just use one signal line to trigger 
an intenxipt. The interrupt causes the processor to save its 
state on the stack or in a duplicate set of registers, then 
execute softw'are that asks all possible interrupt sources 
whether they need service—interrupt-driven polling. 

More sophisticated systems allow the intermpting de¬ 
vice to identify itself. Some kind of arbitration determines 
which source has the right to put its identifying vector on 
the bus during the interrupt-acknowledge cycle. The vec¬ 
tor generally changes into an address at which the corre¬ 
sponding service code starts in memory. The arbitration 
mechanism varies from system to system. It may use a 
central method, so that individual signal lines connect to 
each interrupt source, or a distributed method, using a 
daisy chain or some other arbitration mechanism. Often a 
combination of polling and vectoring is used. For example, 
the service routine might poll a list of devices known to be 
connected to interaipl line 3- 

These mechanisms are adequate for finding the source of 
an intenmpt but do not deal at all with the problem of deter¬ 
mining which of several processors should service it. 

Systems with multiple processors require a more gen¬ 
eral mechanism. The mcxst common solution provides a 
special control register associated with the interaipt sys¬ 
tem of each processor. To generate an interaipt, a device 
must become the bus master and write to the intenxipt 
register located at the address corresponding to the pro¬ 
cessor to be interrupted. Becoming a i:)us master and ex¬ 
ecuting a write may require additional hardvv'are in the 
interrupting device. So, some multiprocessor systems 
(Fastbus. IF’EE Std 960) also provide a hybrid capability 
that allows a simple device to assert a signal that causes a 
shared intermediarv' device to (possibly poll and then) 
perfomi the write for it. 

It is tempting to design the intermpt registers to accept 
vector information directly. A device might write its as¬ 


signed identifier into the register for use much like the 
vector used in simple interrupt systems. This temptation 
should be avoided, because there are hidden perils. For 
example, what happens if a second interaipt-vvrite arrives 
soon after the first? Either it must be stored, implying a 
FIFO, or it has to be aborted and retried later. Any FIFO 
has a finite capacity’, however, so under adverse condi¬ 
tions it might not be capable of holding any more vectors. 
But aborting the write and retiying also causes trouble, 
because it can lead to deadlocks. For example, suppose 
the intenxipt service routine of one processor must send 
an interrupt to another, and the other processor has a ser¬ 
vice routine that has to interaipt the first. When both pro¬ 
cessors' FIFOs fill for some reason, deadlock ensues because 
neither can do anything except continue to retry sending 
the intenxipt. 

We can argue that these conditions are anomalous and 
designing the hardware to have large enough FIFOs will 
make the problem unlikely to occur. However, experience 
shows that real multiprocessor systems seem to seek out 
these trouble spots. 

A dean solution puts the burden of allocating resources 
on the interrupt requester, so that it cannot ask for an 
interrupt unless it has the resources to ensure the interrupt 
request can be delivered. Suppose the requester allocates 
a block of memory^ to hold its vector and possibly other 
seixlce information, links the block into a list of service 
requests, and then writes a pulse to the interrupt service 
register. Now. no possibility of being blocked exists, and 
no deadlock can occur. 

We can design the serv ice-request list so that any num¬ 
ber of interrupt requesters can concurrently add their blocks 
to the list while one sewer removes blocks to serv ice them. 
We will not need semaphores or lock variables, if the sys¬ 
tem supports indivisible swap and compare-and-swap trans¬ 
actions.* These transaction types are extremely valuable in 
multiprocessor systems and deserve wide support by new^ 
processor architectures. 


to the list describing what needs to be done, then sets the 
appropriate bit in the intermpt register. The item in the list is 
similar to the intermpt vector in single-processor systems. 
When the processor is ready to service that priority of inter- 
ixipts, it takes work items off the list and services them. 

This mechanism is clean because the service requester al¬ 
locates all storage, so there is no danger of the server mn- 
ning out of storage when the work piles up (a possible cause 
of deadlock). The intermpt bit is simple to implement be¬ 


cause it is only a latch, with no FIFO storage or critical timing 
implied. Setting the bit merely alerts the processor tliat the 
list should be checked; the processor can clear the bit as 
soon as it commits to look at the list. (See the Interrupts box 
for more information.) 

A common use of mutual exclusion is to grant sequential 
use of one resource to a series of requesters. This process can 
generate a large amount of useless interconnect traffic as pro¬ 
cessors keep updating their cached copies of the shared ex 
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elusion variable. SCI defines a queue-on-lock-bit, or QOLB, 
mechanism that uses tlie linked lists of the cache coherence 
system to pass the resource efficiently from one processor to 
the next. 

Statistics on memory sharing are controversial, because few 
relevant machines exist and because the usage patterns on 
any real machine will evolve to optimize perfomiance on that 
architecUire. However, most researchers agree that unshared 
data is the most common case, followed by pairwise shared, 
followed by multiply shared. Thus the pair\\'ise sharing case 
may be an important one to optimize. SCI defines optional 
pairwise sharing optimizations that allow two processors to 
pass data back and forth without interacting with memory, 
thus distributing system traffic and reducing activity at the 
memory controllers. 

We made significant changes in the initialization model and 
in the executable C eexie that embodies the detailed specifica¬ 
tions. In particular, we changed compile-time options to 
aintime options so that the same code could be used for test¬ 
ing the interaction of nodes implementing different option sets. 
Also, we significantly improved bit-serial link specifications 
using material supplied by the Serial HIPPl working group. 

The revision process converted all but one of the .seven 
negative votes to affirmative by responding to the concerns 
expressed. As mentioned earlier, the one remaining negative 
vote can only be changed by converting the C code to Pascal, 
which would be unacceptable to the working group. 

We redistributed the resulting Draft 2.0 to the balloting Ixxly 
in Decemix^r 1991 and expect that the standard will receive 
final IEEE approval early in 1992. Supporting chips should tx" 
available within a few months. Dolphin SCI Technology of 
Oslo, Norw'ay, will offer single-chip SCI interfaces that incor¬ 
porate transceivers. FIFOs, and cache coherence support. Sev¬ 
eral versions are planned, for use as a processor, a memory, or 
an I/O interface. Hewlett Packard is preparing parallel/serial 
converter chips to interface the Dolphin chips (parallel) to the 
l.OOO-Mbps serial encoding used on optical fiber or coaxial 
cable. The.se chips should be available early in 1992. 

Related standards projects 

I’he following projects are either used by SCI or form a 
part of ongoing work related to future SCI developments. 

IEEE Std 1212. The Control and Status Register Architec¬ 
ture standard defines the I/O architecture for SCI. Future- 
bus+ (IEEE Std 896.1 and 896.2-1991), and Serial Bus (P1394). 
David V. James. Apple Computer, 20525 Mariani Ave., 
Cupertino. CA 95014, phone 408-974-1321. fax 408-974-0781, 
dvj@ apple.com, chaired the group. Voters approved the draft 
standard, which was then modified in response to ballot com¬ 
ments. The balloting body received a second and final 
recirculation, and the IEEE Standards Board granted final ap¬ 
proval in December 1991. 

IEEE Std 1301. Tile Metric Ecji{ipme}it Practice for Micro- 


We expect SCI, Draft 2.0, to 
receive final IEEE approval early 
in 1992 and supporting chips to 
be available within a few 
months. 


computers—Coordination Document is an approved stan¬ 
dard (June 1991). This specification defines the generic met¬ 
ric modular packaging family used by the SCI module. Hans 
Karlsson, Ericsson Telecom AB. TN/ETX/T/F, Stockholm, S- 
126 25 Sweden, phone +46-8-719-6037, fix +46-8 719 8282, 
chaired the group. 

IEEE Std 1301.1. Ihe Detailed Standard for a Metric Equip¬ 
ment Practicefor Microcomputers Using 2-mm Connectors and 
Coni>ection Coolingis also approved (June 1991). This specifi¬ 
cation details the specific subset of IEEE Std 1301 that is used 
by the SCI mcxiule. Hans Karlsson served as chair. EIA IS-64, 
Febaiaiy 1991, presently defines the 2-nim connector, but it 
will Ixcome an lEC standard soon. (This connector family is 
.sometimes called Metral, which is DuPont’s trademark.) 

P1394. Tile Serial Bus working group is developing a high¬ 
speed (10-20 Mbytes/s) serial bus that can be used for low- 
cost diagnostics (supporting IEEE Std 1149.1 Boundary .scan 
Architecture in large systems) and I/O. The group aims at a 
very low cost ($15 per connection, including cable, connector, 
and interface) bus for use in consumer products. SCI includes 
a Serial Bus connection in the module power connector. 
Michael Teener, Apple Computer, 3535 Monroe St., Santa 
Clara, CA 95051; phone 408-974-3521, fax 408-985-9893, 
teener@apple.com, chairs the group. Work on this standard 
should complete in 1992. 

PI 596 . 1 . The SCI/VME Bridge project is defining a bridge 
architecture for interfacing VMEbuses to an SCI node. This 
project provides I/O support for early SCI .systems via VME. 
Products are likely to be available in 1992. Bjorn Sollxrg, 
CERN. CH-1211 Geneva 23, Switzerland, phone +41-22-767- 
2677, fax +41-22-782-1820, bsolberg@dsy-srv3.cem.ch, chairs 
the group. 

The project's main decisions involve the mechanism for 
mapping addresses between VME and SCI, which versions of 
VME to support, and how to handle interrupts, mutual exclu¬ 
sion, and cache coherence. 

PI 596.2 T lie Cache Optimizations for Large Numbers of 
SCI Processors project is developing request combining, tree- 
structured coherence directories, and fast data distribution 
mechanisms needed for systems with thousands of proces- 
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sors, compatible with the base SCI coherence mechanism. 
Ross Johnson, Computer Science Dept., 1210 Dayton Street, 
University of Wisconsin, Madison, W1 53706, phone 608-262- 
6617, fax 608-262-9777, ross@cs.wisc.edu, chairs this group. 

This working group is developing and extending ideas 
that came up during the development of the ba,se SCI stan¬ 
dard, but which it felt could be postponed to avoid delaying 
SCI’s introduction. That is, the protocols defined by P1596 
seem adequate for systems with perhaps hundreds of pro¬ 
cessors (enough for a year or so ) and include hooks for add¬ 
ing these optimizations later. 

When a large number of requests is addressed to the same 
node, the interconnect becomes congested and performance 
suffers. Certain kinds of requests, such as reads and fetch- 
and-adds, may be combined in the network to reduce this 
congestion. Request combining allows several requests to be 
combined into one when they meet (waiting in queues in 
the interconnect). All but one of these requests generate an 
immediate response, which tells the requester to get the data 
from that one's cache iastead. 

SCI nodes handle this kind of no-data response already, 
because it is used in the basic coherence protocol. The re¬ 
maining request goes forward, eventually resulting in a re¬ 
sponse that provides the data to that cache, where the other 
nodes read them. Tliis design spreads out the traffic, reduc¬ 
ing congestion. 

Note that these immediate responses relieve the intercon¬ 
nect from retaining any information about the combining, 
which enormously simplifies the process compared to previ¬ 
ous implementations. 

Once the data become available, the time needed to dis¬ 
tribute them to all the requesters becomes important. The 
linear linked lists of the base SCI standard result in times 
proportional to the number of requesters, which can be a 
performance problem in large systems. 

Instead of linear lists, the group would like to maintain a 
tree structure, which could distribute the data in time propor¬ 
tional to the logarithm of the number of requesters. 

At first, group members thought it would be impractical to 
maintain binary trees in the distributed coherence directory 
because the overhead would be too high. The schemes they 
had seen others use were not acceptable for SCI, These in¬ 
volved setting lock variables to get mutual exclusion while 
tree maintenance was done, thus scaling poorly and violat¬ 
ing an SCI design principle. (Lock variables also introduce a 
variety of complications, such as what to do when the pro¬ 
cess that holds the lock fails.) 

Thus the group first considered using approximate or tem¬ 
porary pointers, which would form short cuts along the lin¬ 
ear directory lists but gradually become inaccurate as 
processors rolled out cache lines and so on. Whenever a 
temporary pointer was used, it would be checked for valid¬ 
ity, and the algorithm would drop back to following the (al¬ 


ways valid) linear list when necessary. 

But at the August 1991 meeting, Ross Johnson presented a 
method for maintaining correct trees at all times, without 
using lock variables and without adding much overhead. 
Though details need to be worked out and some corner cases 
need more study, the group feels the remaining que.stions 
can be resolved. 

Open issues concern the worst case scenarios (when things 
happen in the worst possible sequence) and how to reduce 
the likelihood of having these occur. 

PI 596.3. The Low-Voltage Differential Signals for SCI 
project specifies low-voltage differential signals suitable for 
high-speed communication between CMOS, GaAs, and 
BiCMOS logic arrays used to implement SCI. The object is to 
enable low-cost CMOS chips to be used for SCI implementa¬ 
tions in workstations and personal computers, at speeds of at 
least 200 Mbyte.s/s, Gary Murdock, National Semiconductor, 
642 Pineview Drive, San Jose, CA 95117, phone 408-721- 
7269 , fax 408-721-7218, chairs this group. 


Five current projects 
support future SCI applications. 


Faster signaling requires smaller signals, if edge rates and 
currents are to be kept reasonable. Smaller signals require 
differential signaling (or at least their own reference inde¬ 
pendent of system ground). At first glance, differential signal¬ 
ing seems to cost a factor of two in signal traces and pins. But 
the real cost is much smaller because far fewer ground pins 
are needed, far less system noise is created (or picked up), 
and the higher signaling speeds reduce the number of paral¬ 
lel signals needed. 

SPICE modeling shows that we can signal at SCI speeds (2 
ns/bit/signal pair) with contemporary CMOS technology. 
MOSIS (a fast-turnaround prototype chip fabrication service) 
test chips have already reached nearly this performance. In 
fact, the hardest part of the problem is how to provide or 
accept the data at the signaling rate! 

The group bases present modeling on a 250-mV voltage 
swing, centered on IV, This approach provides a little head- 
room for common-mode rejection at the receiver, while al¬ 
lowing use of 5V, 3 . 3 V, and eventually 2V technologies. 

The working group is choosing certain signal levels and 
rates to be supported as signal interchange (link) standards 
for CMOS implementations of SCI. It will also define an 8-bit 
and possibly a 4-bit link to complement the Std 1596-defined 
16 -bit and 1-bit links. This work should complete in 1992. 

PI 596.4. The High-Bandwidth Memory Chip Interface 
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As technology advances, 
SCI will define new 
physical link standards 
for higher performance 
or lower cost. 


project defines an interface that will permit access to the 
large internal bandwidth available inside dynamic memory 
chips. The goal is to increase the performance and reduce 
the complexity of memory’ systems by using SCI signaling 
technology and a sul')set of the SCI protocols. This work was 
started by Hans Wiggers of Hewlett Packard Laboratories, 
and I now chair it. (My address appears at the end of this 
article.) 

A serious problem with present memory systems is the 
need to use a large number of memory chips in parallel banks 
to get the bandwidth needed for today’s powerful micropro¬ 
cessors. As the capacity per chip increases, the smallest 
memory configuration with adequate bandwidth reaches a 
point where it has an unreasonably large capacity (and need¬ 
lessly high cost). Furthermore, the increments for expansion 
are too large. We hope to get much higher bandwidth from 
far fewer chips by using SCI signaling technology. This ap¬ 
proach will lower the entry cost for low-end systems and 
raise the performance of high-end systems. 

Designers are considering several models.^ One of the most 
promising uses several RAM chip ringlets attached to a single 
controller by 8-bit-wide point-to-point links. The name RAM 
Link is becoming popular for this approach. Details of the 
signaling are still being worked out. but there seems to be 
general agreement to use small signal voltages. 

Other issues include whether to peifonn ECC (error check¬ 
ing and correction) in each RAM chip or in the controller. 
Including it in the RAM chip leaves the details up to the 
vendor, with possible future technology improvements be¬ 
ing incorporated transparently. Including it in the controller 
is probably the lowest initial cost and gives the system de¬ 
signer the most control. Hut ECC involves a performance 
penalty for transmitting the extra information on the links. 
We could increase link performance by using a 9-bit-wide 
link, but that would cost pins and power. 

Another open question concerns the link protocol to be 
used. If an SCI bypass FIFO can be incorporated on the 
RAM chips, we can use a simplified version of the present 
SCI protocols. If not, we must define a scheduling mecha¬ 
nism that prevents two packets from being transmitted at 


once. Predictability issues, such as the effects of refresh or 
ECC soft-error correction on the RAM chip, complicate sched¬ 
uling. However, in the November 1991 meeting the group 
proposed a “designated token” scheduling mechanism that 
looked very promising. Draft 0.11 was presented at the January 
meeting. 

P1596.5. The Shared-Data Formats Optimized for SCI 
project specifies data formats for efficiently exchanging data 
between byte-addressable processors on SCI. SCI supports 
efficient data transfers between heterogeneous w’orkstations 
within a distributed computing environment. Current systems 
require conversions among large numbers of vendor- or 
language-dependent data formats; specifying a single trans¬ 
fer fomiat greatly reduces the complexity of this conversion 
problem. In addition to simplifying the data-interchange prob¬ 
lem, standard data formats provide a framework for the de¬ 
sign of future processor instruction sets and language data 
types. David V. James, Apple Computer, chairs this group. 

The specification defines integer and floating-point sizes, 
formats, and address-alignment constraints. It supports bit 
fields as subcomponents of a larger byte-addressable integer 
datum. Work on this project has just begun, but it should not 
take long to complete, since much of the groundwork has 
been done earlier in conjunction with development of the CSR 
Architecture.^ Draft 0.50 was pre.sented at the January meeting. 

Future plans 

As technology advances, SCI will need to define new link 
standards. For example, present fiber optic technology is 
currently expensive at 1,000 Mbps and prohibitive at higher 
rates. Yet this situation changes rapidly, and SCI applications 
in high-definition television would be greatly helped by a bit 
rate at least double this. We will monitor progress in this 
area. 

Similarly, SCI offers enormous application possibilities for 
slower links. An 8-bit-wide link operating at 250 Mbyles/s or 
500 Mbytes/s might be about right for the next-generation 
personal computers. Perhaps it will be appropriate soon to 
define a personal computer form factor and signal standard 
for SCI. based on new CMOS chips or processors with inte¬ 
grated SCI interfaces. 

While designing SCI, we learned a lot about how the pro¬ 
cessor should interact with the interconnect. We are consid¬ 
ering how best to spread this information to processor 
designers, to make multiprocessor systems more efficient. 
Possibly, this could be a recommended practice, or even a 
standard, clean. 64-bit RISC architecture optimized for use 
with SCI. 

We have in mind two projects for bridges to Futurebus+ 
that are currently waiting for the right time to start. One is a 
very simple bridge to Profile B, an I/O bus with no cache 
coherence. The other is a general symmetric bridge that in¬ 
cludes cache coherence. 
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The base SCI STANDAI® covering toe physical sig- 

naling, logical protocols, and cache coherence mechanism 
should be approved by the IEEE Standards Board in March 
1992 . The first commercial implementer (Dolphin SCI Tech¬ 
nology, Oslo, Norway) expects to have working prototypes 
within a few months of the standard’s approval and has prom¬ 
ised to make the interface chips available to others. 

SCI’s performance seems such a large step ahead of the 
current state of the art in computer buses that it has some 
difficulty in appearing credible. The best answer to these 
doubts will be the existence of working silicon, available at 
reasonable prices, being used in working systems. 

The complexity of SCI is approximately the same as that of 
a split-cycle bus system (like the VAX BI or Futurebus+ or 
Fastbus with Buffered Interconnects) in small applications. It 
is much less than that of a bus system for large applications. 
Nevertheless, relatively few designers have experience with 
split-cycle bus design issues, and therefore we foresee some 
need for training as they move to SCI. 

SCI has no evident competition in terms of open systems 
that could hope to deal with the massive computation needs 
for the next generation of data acquisition, analysis, and 
general computation. Nor is there any competitor that spans 
SCFs whole application range: campuswide optical LAN, desk¬ 
top workstation bus, network shared server, I/O interface, 
data acquisition system, highly parallel multiprocessor, 
supercomputer. 

For details, or to participate in this work, please contact 
me. P 
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Unix and the Am29000 Microprocessor 


Though targeted for use in medium- to high-performance embedded applications, the Am29000 
includes several design provisions allowing its use in Unix workstation applications. For 
example, a scalable interrupt-handling mechanism makes the processor particularly suited 
to real-time Unix operations. The relevant features discussed here will help users interested 
in building a Unix system. 


Daniel Mann 

Advanced Micro Devices 



se of RISC processors in medium- and 
high-performance embedded applica¬ 
tions is growing rapidly. As prices fall, 
it seems likely that RISC machines will 
dominate CISC processors as the system of choice 
for an even wider range of new embedded sys¬ 
tem designs. A number of companies are already 
manufacturing processors using RISC principles 
that offer a better price-performance ratio than 
the top-end CISC devices. 

Many designers have chosen to use Advanced 
Micro Device’s Am29000 processor in complicated 
embedded applications. The increased use of 
higher level languages and real-time operating 
system support services with embedded RISC 
applications requires engineers to know more 
about processor features previously more widely 
used in workstation applications, such as Unix. 

Unix was developed on small computers and 
it makes modest demands on a host processor. 
However, for good perfonnance Unix requires 
the processor to provide certain basic facilities. 
The Am29000 has several features that make it a 
particularly suitable Unix host. For example, the 
architecture is scalable yet has features for main¬ 
taining code compatibility. 

Our processor’s intenvipt-handling mechanism 
services devices without incurring a built-in ex¬ 
ception-processing sequence. This is of particu¬ 
lar interest to implementors of Unix systems that 
are also constrained with real-time events. Users 
are free to design the necessary interrupt-handling 
processor environment for maximum efficiency. 

Among the Am29000’s features of particular 


interest to designers of high-performance Unix 
systems are 

• data and instruction caching, 

• memory management, 

• multiple data transfer instructions, 

• freeze-mode operation, 

• multiprocessor support, 

• floating-point support, and 

• flexible memory systems. 

A detailed discussion of the Am29000’s archi¬ 
tecture^'^ is beyond the scope of this article, as is 
a discussion of Unix internals.^'"^ Rather, I point 
out features that make our processor a good Unix 
host. 

C calling sequence 

Making a subroutine call on a processor with 
general-purpose registers is expensive in terms 
of time and resources. Because functions must 
compete for register use, registers must be saved 
and restored through register-to-memory and 
memory-to-register operations. For example, a C 
function call on Motorola’s MC68000 processor 
(see Table 1 on page 24) inight use the statements 

char bitsB; 

short bits 16; 

printf(“char= %c short=%d”, bits8, bitsl6); 

After they are compiled, they generate the fol¬ 
lowing assembly-level code: 


0272-1732/92/0200-0023$03.00© 1992 IEEE 
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Am29000 


To reduce future access delays, the 
system normally copies data to gen¬ 
eral-purpose registers before using it. 
For instance, using a memory-to- 
memory operation when moving data 
from the local frame of the function 
call stack would reduce the number of 
instructions executed. However, these 
are CISC instructions that require sev¬ 
eral machine cycles before completion. 

In the example, the C function call 
passes two variables, bitsS and bitsl6, 
to the library function printf( ). The fol¬ 
lowing assembly code shows part of 
the printf( ) function for the MC68000: 

_printf: 

LINK A6, ^-32 :local variable 
space 

LEA 8 [A6]. AO :unstack string 
pointer 

UNLK A6 

RTS 

Several multicycle instructions are re¬ 
quired to pass the parameters and es¬ 
tablish the function context. Unlike the 
variable instruction format in the 
MC68000, the Am29000 has a fixed 32- 
bit instruction format (see Figure 1). 
The same C statements compiled for 
the Am29000 (see Table 2) generate 
the following assembly code for pass¬ 
ing the parameters and establishing the 
function context: 


MOVE.W 

-4 [A6], DO 

;stack bitsl6 variable 

LI: .ascii 

“bits8=%c bitsl6=%d’' 

EXT.L 

DO 





MOVE.L 

DO. -[A7] 


const 

lr2,Ll 


MOVT.B 

-1 [a61, do 

;stack bits8 variable 

consth 

lr2.Ll 


EXTB.L 

DO 


add 

Ir3,lr6,0 

anove bits8 and bitsl6 

MOVE.L 

DO. -[A7] 


add 

Ir4.1r8,0 

;to bottom of the 

PEA 

L15 

;stack text string ptr 



;activation record 

JSR 

_printf 


call 

lrO,_printf 

;return addr in IrO 

LEA 

12 fA7], A7 

irepair stack pointer 





Table 1. MC68000 instructions. 

Instruction 

Comment 

MOVE.W saddr,daddr 

Move 16 bits of data from saddr to daddr. 

MOVE.W-4 [A6], DO 

Source address is register indirect with displace¬ 
ment. Destination address is data register 
direct. The word at memory location -4 
relative to the current frame pointer (A6) is 
copied into data register DO. 

MOVE.B saddr, daddr 

Move 8 bits of data from saddr to daddr. 

MOVE.L saddr, daddr 

Move 32 bits of data from saddr to daddr. 

EXT.L data_register 

Extend the sign of 16-bit data to 32-bit register 
size. 

PEA LIB 

Push address LI 5 onto the stack (A7). 

JSR _printf 

A jump to subroutine _printf is taken. The current 

PC value is first pushed onto the stack (A7). 

LEA 8 [A6], AO 

The address of the data object located 8 bytes 
above the current frame pointer (A6) is loaded 
into address register AO. 

LINK A6,#-32 

Push the frame pointer (A6) onto the stack. Then 
copy the stack pointer (A7) to the frame 
pointer (A6). The stack pointer (A7) is then 
lowered by 32 bytes. This instruction takes 
several cycles and is used in a procedure 
prologue. 

UNLK A6 

The frame pointer (A6) is copied to the stack 
pointer. A new frame pointer is then popped 
out of the stack. This instruction takes several 
cycles and is used in a procedure epilogue. 

RTS 

The address value for a subroutine return is 
popped out of the stack (A7) and loaded into 
the PC. 


LI5: .ascii ‘Thar= %c short=%d’' 


This assembly listing shows how parameters pass via the 
stack to the function being called. 

The LINK instmction copies the stack pointer A7 to the 
local frame pointer A6 upon entry to a routine. The param¬ 
eters passed and local variables in memory are referenced 
relative to register A6. 


Op code 

Operand C 

Operand A 

Operand B 

8 bits 

8 bits 

8 bits 

8 bits 


Figure 1. Am29000 fixed 32-bit instruction format. 
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Table 2. Am29000 instructions. 

Instruction 

Comment 

const reg, value 

The value is placed in the selected register, This is a 
data-direct instruction. Operand fields A and B hold 
the 16-bit constant. 

const Ir2, LI 

The lower 16 bits of address LI are placed in local 
register Ir2. The high 16 bits of Ir2 are cleared. 

consth lr2, LI 

The high 16 bits of address LI are placed in the high 

16 bits of local register Ir2. 

add des, srcA, srcB 

The two source operands A and B are added, and the 
result placed in the destination register. 

add tr3, Ir6, 0 

Zero is added to the lr6 value, and the result is placed 
in local register Ir3. This is effectively a register-to- 
register move operation. 

call IrO, _printf 

A jump to subroutine _printf occurs in the cycle 
following the current cycle. The current PC address 
plus 8 is placed in local register IrO. This is 
effectively a subroutine call with the destination 
address obtained by adding the current PC value 
and the 16-btt value formed by operand fields A 
and B. The Am29000 implements delay-slot 
branching, thus it always executes the instruction 
following a branch instruction. Direct 32-bit address 
procedure calls are supported with the CALLI 
instruction. 

jmpi IrO 

The value in local register IrO is loaded in the PC. 
Execution at the destination of a jump commences 
after the instruction following the JMPI has 
executed. This instruction implements a subroutine 
return. 

asgeu V_SPILL, grl,rab 

This is a conditional trap instruction. Such instructions are 
used in procedure prologue and epilogues to check 
whether cache spilling or filling is required. Trap number 
V^SPILL is taken if the assertion that the global register 
gr1 is greater than or equal to global register rab is false. 


Register stack. We define a register 
stack in an assignee! area of memory to 
pass the parameters and allocate working 
registers to each procedure. The register 
cache replaces the top part of the register 
stack, as shown in Figure 2 on page 26. 

The global registers rab and rfb point 
to the top and bottom of the register 
cache. Global register rsp (also known 
as grl) points to the top of the register 
stack. The register cache, or stack win¬ 
dow, moves up and down the register 
stack as the stack grows and shrinks. Use 
of the register cache allows data to be 
accessed through local registers at high 
speed. On-chip triple-porting (two read 
ports and one write port) enables the 
register stack to perform better than a 
data memory cache, which cannot read 
and write in the same cycle. 

Activation record. Our processor 
does not apply push or pop instaictions 
to external memory. Instead, each func¬ 
tion is allocated an activation record in 
the register cache at compile time. Acti¬ 
vation records hold local variables and 
parameters passed to the function. 

The caller stores its outgoing argu¬ 
ments at the bottom of the activation 
record. The called function establishes a 
new activation record below the caller's 
record. The top of the new' record over¬ 
laps the bottom of the old record, so the 
outgoing parameters of the calling func¬ 
tion are visible within the called 
function's activation record. 

Although the activation record can be 
any size within the limits of the physical 
cache, the compiler will not allocate 
more than 16 registers to the parameter¬ 
passing pait of the activation record. Functions that cannot 
pass all their outgoing parameters in registers must use a 
memory stack for additional parameters (global register msp 
points to the top of the memory stack). This happens infre¬ 
quently. but it is required for the parameters that have their 
address taken. Data parameters at known addresses cannot 
be supported in register address space because data addresses 
always refer to memory and not to registers. 

The following code shows part of the printft ) function for 
the Am29000 processor: 

_printf: 

sub grl, grl, 16 function prologue 


asgeu 

V_SPILL, grl, rab 

■.compare with 
;top of window' 

add 

Irl, grl, 36 

;rab is grl 26 

jmpi 

IrO 

; return 

asleu 

V_FILL, Irl. rfb 

;compare with 
;bottom of 



;window grl 27 


The register stack pointer rsp points to the bottom of the 
current functions activation record. All local registers are refer¬ 
enced relative to rsp. Four new' registers are required to sup¬ 
port the function call showm, so rsp is decremented 16 bytes. 
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register window* 
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Empty 



Register cache 

Lower 
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Memory resident 
portion of stack 


Empty 


Register 

stack 



rsp points to the 
top of the stack 


rab points to the 
^— top of the cache 
register window 


External memory 


* Since the stack grows down, the “bottom” (older) activation records 
are located above the “top” (newer) activation records. 


Figure 2. Register stack window. 


Rsp performs a role similar to the MC68000’s A7 and A6 
registers except that it points to data in high-speed registers, 
not data in external memory. 

The compiler reserves local registers Irl and IrO for special Higher address 


of the cache window (rfb). If the activa¬ 
tion record is not stored completely in the 
cache, then the fill overhead occurs. 

Performance. The register stack im¬ 
proves the performance of cal! operations 
because most calls proceed without 
memory access. The register cache con¬ 
tains 128 registers, so very few function 
calls or returns require register spilling or 
filling. 

Because most of the data required by a 
function reside in local registers, there is 
no need for elaborate memory-address¬ 
ing modes, which increase access latency. 
The function call overhead in the Am29000 
consists of a small number of single-cycle 
instructions; the MC68000 requires a 
greater number of multicycle instructions. 

Context switching 

Context switching occurs when one 
process gives up control of the CPU with¬ 
out terminating, and another process, 
which had previously given up control, 
resumes executing. When this happens, 
the state of the processor being used by 
the process (the context) must be saved, 
and the context of the other process must 


duties within each activation record. LrO contains the execu¬ 
tion starting address when it returns to the caller’s activation 
record. Lrl points to the top of the caller’s activation record. 
The new frame allocates local registers lr2 and lr3 to hold 
printf function local variables. 

As Figure 3 shows, the positions of five registers overlap. 
The three printfC ) parameters enter from lr2, lr3, and lr4 of 
the caller's activation record and appear as lr6, lr7, and lr8 of 
the printf( ) function activation record. 

Spill and fill. If not enough registers are available in the 
cache when it moves down the register stack, then a V_SPILL 
trap is taken, and the registers spill out of cache into memory. 
Only procedure calls that require more registers than cur¬ 
rently are available in the cache suffer this overhead. 

Once a spill occurs, a fill (V_FILL trap) can be expected at 
a later time. The fill doesn’t happen when the function call 
causing the spill returns, but rather when some earlier func¬ 
tion that requires data held in a previous activation record 
(just below the cache window) returns. Just before a function 
returns, the lrl register, which points to the top of the caller’s 
activation record, is compared with the pointer to the bottom 


Top of 
activation 
record 



(-► 


Ir8 

Incoming parameter 


Ir7 

Incoming parameter 


Ir6 

Incoming parameter 


Ir5 

Frame pointer 


Ir4 

Return address 


Ir3 

Local 


Ir2 

Local 


L lrl 

Frame pointer 

—► IrO 


gr1 (rsp) 
when _printf 
executes* 




.printf activation 
record 
is 9 words 


- Base of caller’s 
activation record 
(gr1 before 
_printf is called) 

■ Base of _printf 
activation record 


* gr1 is lowered 4 words (16 bytes) in 
the prologue of _printf. 


Figure 3. Overlapping registers. 
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be restored. Other than allowing multiple processes to share 
the same CPU, this saving and restoring of context perform 
no useful work. If context switching is performed frequently, 
it should be a low-overhead operation. 

One result of having a large register file is an increased 
context-switching time. Approximately half of the 128 local 
registers and about 32 of the global registers contain data for 
the user’s process. About 96 memory writes and 44 memory 
reads are required to store and reload these registers with a 
new context. However, with single-cycle, burst-mode access, 
the register save and restore time can be as short as 5.6 (is. 
Note that the context reloading for synchronously saved pro¬ 
cess states is faster than context saving because only the acti¬ 
vation record of the currently active procedure need be 
restored. The number of local registers requiring restoring de¬ 
creases from approximately 64 to, typically, 12. 

Memory access time dominates the total context switch time. 
In a low-cost system with no external data cache, this is most 
apparent. When using five-cycle read/write memory, the 
memory access time is 28 (is. In practice, any increase in con- 
text-switching time due to the large register set is more than 
offset by the savings in normal execution. This is because 
context switches (30 to 60/s) are less frequent than system 
calls (150 to 250/s) or subroutine calls (about 350,000/s). The 
Am29000 can switch context in as little as 10 (is, about 45 
times faster than the VAX 11/780. 

System calls 

System calls are trusted library functions that execute with 
kernel mode pennissions to access resources otherwise un¬ 
available to the user. To users, a system call looks like a nor¬ 
mal C function call. But when a user program executes a 
system call, it leaves user mode and enters kernel mode. 

Permission levels. Wlien a system call is made, the se¬ 
lected function executes a trap instruction to change the pro¬ 
cessor status. After the trap is taken, the process operates with 
kernel mode permissions. It is usually necessary for the oper¬ 
ating system to copy the calling parameters from the user mode 
register stack to the kernel data space, because system call 
functions receive tlieir parameters from kernel data memory 
space. To enable system ftinctions to appear as ordinary C 
functions, the register .stack parameters are transferred into a 
kernel data stnicture known as the upage by a system call 
dispatcher routine. 

The kernel mode pennissions of system calls may not ac¬ 
cess data that the user would not nomially be able to access. 
When moving data between user space and kernel space, 
many processors use assembly level instructions to manipu¬ 
late the memory management unit (MMU) or other protection 
hardware. This allows the kernel to access the user data space 
with the user’s pemiissions rather than kernel’s permissions. 

The usual set of Unix functions available includes a fetch 
user byte, fubyte(a), and a store user byte, subyte(a,v). As an 


example, the statement 

result = fubyte(addr); 

fetches a byte of data from user address addr and stores it in 
the variable result. This variable is in kernel data space, and 
the access of the user data space occurs with the user’s per¬ 
missions. The subyte(a,v) function moves data in the opposite 
direction, from kernel space to user space. This routine is re¬ 
quired to modify data .structures in the user’s address space. 

Protection boundaries. Our processor efficiently copies 
data across protection boundaries in a Unix implementation. 
System call parameters are not located in the user’s data 
memory space but in the regi.ster cache, where they don’t 
result in a heavy overhead. 

After the system call trap is taken, call parameters can be 
copied from the registers into the upage without any pennis- 
sion problems because the register cache is wholly owned by 
the user process. There is no risk of the kernel accessing data 
belonging to another user. The store multiple instruction 
moves data quickly from regi.sters to memory. 

The proce.ssor handles system call return values the same 
way as any other function call return values. Global registers 
from gr96 through grill hold the first 16 words of a function 
return value. Any other data in the user’s data space that the 
system call u.ses or updates can be accessed by setting the user 
access (UA) bit in load and store (LOAD, STORE) instructions. 
The UA bit allows programs executing in kernel mode to emu¬ 
late user mode access. Doing so enables functions .such as 
subyte( ) to be implemented inexpensively, particularly since 
the UA bit can be used with the load and store multiple 
(LOADM, STOREM) instructions. 

Overhead. Because few accesses to data memoiy are re¬ 
quired, the overhead of switching from user mode to kernel 
mode is small. In kernel mode, separate memory and register 
stacks are used, but they are .still acces.sed via global registers 
rab, rfb, msp, and rsp. Upon entering kernel mode, the.se reg¬ 
isters contain user mode values which are then stored in the 
upage (part of the context save) and replaced with addresses 
of kernel-mode memory and register .stacks. The register stack 
is not pushed to memoiy, but rather the kernel register stack is 
added to the end of the cache. 

If the barriers between user and kernel modes are high and 
secure, the computation required for a mode change may far 
exceed the cost of the actual requested operation. In cases of 
extremely poor design, the simplest system calls may require 
hundreds or even thousands of overhead instructions. 

In keeping with RISC concepts, we structured the system 
call operation to ensure the minimum overheads at each stage. 
As an example of the results achieved, getpid( ) executes in 
about 10 (IS, depending on the memory architecture. The same 
function call has been mea.sured at 400 (is on a VAX 11/780. 
The system call mechani.sm re.sult.s in an overhead that is one 
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hundredth of that incurred by a VAX 11/780. 

Interrupt handling 

As a RISC processor, the Aiti 29000 limits the mechanism for 
handling interaipts and traps so system designers need not be 
concerned about when, where, or how much state is saved. 
State saving is not built into the processor intenaipt mecha¬ 
nism but is left for the programmer to implement. Interrupts 
use a vector table to select the required seivice routine. After 
nine cycles (0.36 gs) or less, the processor executes the ser¬ 
vice code. A freeze mode makes this possible. 

Freeze mode. When an interrupt or trap occurs, the pro¬ 
cessor enters freeze mode. The hardware execution context 
described by critical CPU registers is not saved, but the state of 
these registers is frozen and cannot be updated until an inter¬ 
rupt return is taken or freeze mode is turned off. Interrupts 
that do not require the use of processor special registers can 
be serviced in freeze mode, litis allows a fast interrupt service 
response. In fact, it is practical to implement a software cache- 
controlled reload in low-cost systems. 

Certain critical registers are required for the execution of C 
code routines. If the interaipt-handling code is just assembly 
glue used to reach a C code routine, then the critical registers 
must be saved explicitly before freeze mode is turned off. 
Users can tailor the form of the register context stacking pro¬ 
cedure to meet special needs. This facility can prove useful in 
optimizing system operation. 

Interrupt servicing. Our processor’s high-speed opera¬ 
tion enables interaipts to be handled by a Unix implementa¬ 
tion without assigning a priority order. When an interaipt is 
detected, it will be serviced immediately if no other interaipts 
are cuaently being serviced. Otherwise, the inteniipt is added 
to the end of a queue. The processor continually tries to empty 
this queue and return to executing user processes. 

When the first interrupt occurs, possibly initiating a queue, 
the processor may be in user mode or kernel mode. Although 
each process executes the same kernel code, each process has 
its own kernel stack for kernel mode function calls and data. 
In user mode, the kernel memory stack stores data during the 
interrupt service routines. If the inteaupt occurs in kernel 
mode, then a separate shared interaipt stack is used. 

Although I have described only one stack, the processor 
contains two: one for conventional memoiy data and one for 
register data that has spilled out or swapped out of the register 
cache. 

Virtual address space 

Unix is a multiprocess system, with several processes ex¬ 
isting in different areas of system memory at the same time. 
Programs may be loaded at some memory location and then 
change location during execution due to being swapped out 
and back in again. Programs are usually intended to execute 
from viitual address zero, but the actual physical memory 


used by the program is dependent on the relocation function 
of the MMU. The program, or parts of the program, may get 
swapped to new physical locations. But because of the ad¬ 
dress translation of the MMU, they always appear to be at the 
same virtual address. 

On-chip MMU. We built a standard MMU into the Am29000 
chip to lower system cost and maintain compatibility among 
29000 users. Most other processors require expensive exter¬ 
nal hardware to translate addresses. In such cases, designers 
sometimes prefer to develop their own MMUs. This can lead 
to incompatibilities with code developed for other designs. 

Translations. The Am29000 uses a 64-entry translation 
look-aside buffer (TLB) to perform address translations. The 
TLB translates the most frequently used instaiction and data 
memory pages and reflects the information contained in more 
extensive tables in memory. Data addresses are translated 
during each execute processor stage. Instruction addresses 
are translated whenever a jump is taken, achieving virtual 
memoiy support without any access delays. 

Because of the TLB's limited size, there is a chance that an 
access will be requested for a memoiy page not currently 
translated—a TLB miss. When this occurs, the special register 
Iru (least-recently used TLB entry) selects a TLB register for 
replacement with new translation data. 

The TLB registers are not updated automatically by hard¬ 
ware. When a TLB miss occurs, a trap to a software routine 
that maintains TLB entries is performed. Using this method to 
update the TLB allows a variety of memory management 
techniques to be implemented. Systems that rely on hard¬ 
ware to update the TLB registers do not achieve this level of 
flexibility. The technique is particularly well suited to the 
emulation of missing hardware accessed in virtual address 
space. A TLB register can be updated in two Am29000 cycles 
(one for each word of a TLB entry). This keeps down the 
software overhead of reloading TLB. Depending on page- 
table layout, a TI.B reload needs 25 to 40 cycles after a page 
miss trap occurs. 

Memory access protection 

In any multiprogramming system, especially one in which 
users are developing, debugging, and testing software, pro¬ 
cess a should be forbidden access to the data of process b if 
accessing that data would impair process h. In addition, some 
operating system data should be protected even from read¬ 
only access. 

Reducing memory usage and permitting efficient 
multiprocess cooperation requires controlled sharing of re¬ 
gions of memory and other resources. For many programs, a 
large portion of executable code consists of the system-pro¬ 
vided library subroutines. For small programs, the libraiy code 
used by the program can be larger than the program itself, 
particularly if the program makes heavy use of standard I/O 
routines. Enhanced Unix systems provide shared libraries so 
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only one copy of the library code is required for all the pro¬ 
cesses in the system. 

Protection violation checking. The TLB hardware checks 
for protection violation. Each TLB entry contains a 6-bit field 
that controls access to the associated page, which is assigned 
to a particular user through a task identifier. Permissions can 
be set to separately enable reading, writing, and execution of 
page data. The processor mode and user ID are checked 
with the access pemiissions for the virtual memory page ac¬ 
cessed. A software trap occurs if access is not permitted. In¬ 
cluding user identifiers in each TLB entry enables Unix to 
switch processes without clearing all TLB entries. Performance 
is improved because valid TLB entries are likely to remain 
and be reused when their associated proce.ss is swapped 
back in. 

Protection control. The Am29000 offers considerable 
access control within the chip. Other systems that use exter¬ 
nal hardware to achieve a similar function incur a greater 
system cost. Also, by including the necessary hardware in 
our design, we achieve a standard that cannot be ensured 
when designers build their own protection systems. 

Cache support 

The Am29000 incorporates an on-chip branch target cache 
(BTC) memory. If the target instructions of a branch are found 
in the cache, the branch executes in one cycle, causing no 
stalling of the processor pipeline. For sequential instruction 
access, an instaiction prefetch buffer ensures that instruc¬ 
tions are fetched up to four cycles ahead of their execution, 
enabling single-cycle instruction execution at high speeds. 

The BTC memory contains the target instruction sequence 
for 32 branches. Simulation results show this is sufficient for 
60 percent of the branch in.structions. If a BTC memory miss 
occurs, the hardware automatically updates the cache with 
the new target instruction sequence. In such cases, the ex¬ 
ecution pipeline stalls for one cycle plus the number of cycles 
required to access the instruction memory. Having an on- 
chip instmction cache enables users to implement low-co.st 
systems that achieve high speeds. 

Data caching. The Am29000 processor caches data via a 
large register file to reduce the number of memory loads and 
stores. This reduces the requirement for an external data cache. 
Of the 192 general-purpose registers, 128 serve as a register 
cache. (.See Figure 4.) This cache stores data variables and 
function call parameters by using an overlapping stack frame 
technique. 

On occasion the cache becomes full and some data spills 
out to memory. Simulation results of programs such as nroff 
show this is necessary in only 0.1 percent to 0.5 percent of all 
calls. Since calls typically constitute 1.2 percent of all Instruc¬ 
tions used, and spilling out and filling back typically require 
only 25 to 30 cycles, the overhead is low. 

Accessing data in the register file does not suffer the delays 


encountered in accessing external data, even if a high-speed 
cache is used. Thus the Am29000 processor is well suited for 
even higher speed devices in future technologies. The regis¬ 
ter file has a further advantage in that it is triple-ported. It 
acce.sses two sources in one cycle, while a previously com¬ 
puted result is written back. Subsequent accesses take one 
cycle, using burst mode. 

On-chip instruction caching. BTC memory caches user 
mode and kernel mode instaictions. The cache tag logic works 
with physical addresses, possibly after they have gone through 
the virtual-to-physical address translation service of the TLB. 
The BTC memory entries need not be invalidated on a con¬ 
text switch. Keeping entries valid for future reuse improves 
performance. However, when the kernel decides to swap 
user process pages in or out of disk memoiy, instructions at 
cached physical memory locations may change, invalidating 
the cache. 

The newer Am29030 processor contains an 8-Kbyte in¬ 
struction cache. The addition of the cache was made pos¬ 
sible by the higher levels of integration not possible when 
the Am29000 was Introduced. Research shows that for cache 
sizes above 4 Kbytes, a conventional cache provides supe¬ 
rior perfomiance to the BTC memory method.^ 


Global registers 


rsp (gri) points here 


Used to implement global support 


Registers such as 
rab, rfb, and msp 


Not implemented 


Also known as rsp; 
points to the top of the cache 

• Registers in the cache are normally referred to as local registers and 
are accessed relative to gri. Gri points to local register IrO. Local 
register Irl is located at register address gri + 4. All registers are 32 
bits wide. 
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Figure 4. Am29000 register stack. 
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Because the BTC memory reduces initial instruction access 
latency rather than improving memory access bandwidth, the 
Am29000 uses a separate bus to support concurrent instruc¬ 
tion fetching. The Am29030 does not require separate buses 
for instaiction and data memory because the on-chip cache 
offers a statistical improvement to the memory bandwidth. It 
maintains sufficient instmction memory bandwidth without 
the support of a dedicated bus. 

Multiprocessor Unix 

Our processor supports on-chip multiprocessor systems 
without the need for large amounts of external hardware. It 
also supports the mechanisms to ensure processor resource 
sharing and signaling. 

An external coprocessor interface extends its execution unit 
with special-purpose instructions. If the coprocessor hard¬ 
ware is not present, then a coprocessor exception trap is 
taken. The processor emulates the coprocessor operation for 
hardware development and code compatibility. 

Multiprocessor cache support. Each TLB entry contains 
2 bits to control external cache operation. Cached pages can 
be marked selectively as not-to-be-cached, local, or shared. 
This enables the operating system to inform the cache of the 
correct action to take, and it results in reduced data traffic on 
the shared-memory bus giving access to the cache. 

A load and lock (LOADL) instmction for device and memory 
interlocks activates the LOCK output pin during the address 
cycle of the access. Multiprocessor hardware uses the lock 
signal to delay access of other processors requesting the 
same memory or I/O facility. 

The load and set (LOADSET) instruction implements a bi¬ 
nary semaphore. The operation of writing a memory sema¬ 
phore location to True after reading the location into a 
general-purpose register, cannot be intermpted. 

Floating-point support 

Unix is a product of the academic and scientific environ¬ 
ment, which means that a number of its user application 
programs require floating-point support. The newer Am29050 
processor supports floating-point instmctions directly. How¬ 
ever, the Am29000 processor instmction set contains a num¬ 
ber of floating-point instructions that are 
not directly supported. 

When a floating-point instruction is 
encountered, it traps to a handler that 
sends the ap]:)ropriate instaiction and 
data to an optional coprocessor, the 
Am29027, and moves the results back to 
the Am29000. Without the coprocessor, 
the Am29000 calls software routines to 
perform the instruction evaluation. This 
latter method is much slower than using 
a coprocessor. But by using the trap 


method, code developed for the Am29000, which may or 
may not have an Am29027 coprocessor available, will run 
directly on future Am29000 processors (and at much greater 
speed). 

A library of software routines has been developed that 
makes extensive use of the floating-point instructions to per¬ 
form functions. We reserved 15 user-mode-accessible global 
registers for math library use to achieve good performance 
for these functions. We can reduce the number of global 
registers by substituting memory locations, but doing so re¬ 
duces performance. Also, all library users must agree on reg¬ 
ister assignment if the library is to be shared. 

System costs 

The on-chip MMU functions enable the Am29000 to oper¬ 
ate without caches and maintain good speed. Table 3 draws 
on information in the “Am29000 Performance Analysis” docu¬ 
ment.^’ The table compares AMD’s 29000 processors running 
at 25 MHz against Sun processors and the VAX 11/780. 

The benchmarks for the Am29000 in Table 4 are for sys¬ 
tems with separate instruction and data memory systems. The 
Am29000 has separate instruction and data memory buses 
and a shared address bus. The need to implement a memory 
system for instructions and a separate system for data can be 
avoided by connecting the instruction and data buses. 
Doing so makes the Am29000 processor bus system appear 
more like a conventional processor bus system. However, 
sharing a single memory system reduces performance by, 
typically, 15 percent. The use of video memory (video DRAM) 
also avoids the need for two memory systems, as the video- 
out port can be connected to the instmction memory bus, 
and the random access port to the data bus. 

Table 4 gives the approximate costs of a CPU and memory 
devices. It also shows the cost for a memory system based on 
4 Mbytes of DRAM with an additional 0.5-Mbyte SRAM used 
to implement a cache. Studies show that software-controlled 
caches can efficiently use high-speed memory with a mini¬ 
mum of hardware complexity and cost.^ The lightweight in- 
termpt-handling capability achieved by the Am29000 processor 
operating in freeze mode is well suited to implementing a 
soft cache control mechanism. 


Table 3. CPU speeds. 

Benchmark 

Am29000 Am29030 

Other processors 

SRAM 

DRAM 

DRAM 

Sun 4 

Sun 3 

VAX 

diff 

20.35 

12.02 

20.86 

8.50 

3.56 

1.0 

grep 

14.81 

9.42 

18.54 

7.00 

3.00 

1.0 

nroff 

14.92 

10.78 

18.31 

7.50 

2.30 

1.0 

Dhrystone 2.0 

23.00 

15.30 

19,69 

11.71 

2.65 

1.0 
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Table 4. CPU and memory costs. 




Am29000 



Am29030 

Feature 

SRAM 

DRAM 

Soft cache 

Video DRAM 

DRAM 

DRAM 

Instruction memory 

4 Mbytes 

4 Mbytes 

4-Mbyte DRAM, 
0.5-Mbyte SRAM 

4 Mbytes 

4 Mbytes 

4 Mbytes 

Data memory 

4 Mbytes 

4 Mbytes 

Shared 

Shared 

Shared 

Shared 

Cost 

$2,792 

$424 

$352 

$448 

$264 

$262 


The Am29030, unlike the Am29000, shares a bus for in¬ 
struction and data memory and thus does not require sepa¬ 
rate memory systems. It achieves higher performance via its 
on-chip, 8-Kbyte instruction memory cache. The Am29050, 
which is currently the only 29K family member that directly 
executes floating-point instructions, is pin compatible with 
the Am29000 processor and thus supports the three-bus 
architecture. 

Because our processor uses only one address bus, the num¬ 
ber of pins and the system costs are reduced. The one- 
address bus technique is successful because of burst-mode 
addre.ssing and the large register cache. The register file greatly 
reduces the need to access data in external memory. Burst¬ 
mode addressing enables instructions to be fetched in se¬ 
quence without sending a new address for each access. The 
address bus is only required to establish the address of the 
first in.struction in a nonsequential instruction fetch. Video 
DRAM or conventional RAM interleaved with a small amount 
of support circuitry, such as an address counter, can be used 
to implement burst-mode memory addressing. 

Contention for the address bus for both instaiction and 
data access is rare. By the very nature of the instmction stream, 
a jump instruction cannot occur on the same cycle as a load 
or a store operation, thus the addre.ss bus is inherently u.sed 
for instruction and data addressing on different cycles. 


The Am29000 processor’s reduced instruction 

set results in concise, efficient executable code. Each proces¬ 
sor in this family su.stains a performance level of 17 VAX 
MIPS or higher. In addition, the processors have several de¬ 
sign features that make them particularly suitable for efficient 
Unix implementation, both in terms of operating speed and 
support peripherals required. 

Among these is a scalable methodology that frees the sy.s- 
tem implementer from a fixed-price, fixed-architecture ap¬ 
proach. Because the architectural features required to support 
Unix are packaged along with the proce.ssor, the Am29000 
processor is particularly suited to use in low-cost systems. IB 
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Hardware Requirements for 
Neural Network Pattern Classifiers 

A Case Study and Implementation 


A special-purpose chip, optimized for computational needs of neural networks, performs 
over 2,000 multiplications and additions simultaneously. Its data path is suitable particularly 
for the convolutional architectures typical in pattern classification networks but can also be 
configured for fully connected or feedback topologies. A development system permits rapid 
prototyping of new applications and analysis of the impact of the specialized hardware on 
system performance. We demonstrate the power and flexibility of the processor with a neural 
network for handwritten character recognition containing over 133,000 connections. 
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eural networks are rapidly gaining ac¬ 
ceptance as powerful and versatile 
tools for pattern classification.^ “ How¬ 
ever, the widespread use of neural 
netv^'ork classifiers remains contingent on the 
availability of powerful hardware to provide ad¬ 
equate speed. Such hardware is particularly im¬ 
portant as the computational requirements of 
neural nenv^ork algorithms are quite different from 
the high-precision processing for which general- 
purpose computers are optimized. Typical neu¬ 
ral network problems involve huge amounts of 
low-resolution and possii:)ly redundant data and 
require a correspondingly high number of low- 
precision arithmetic operations to be performed. 

Special-purpose VLSI (veiy large-scale integra¬ 
tion) processors let us overcome neural network 
implementation problems. With their regular struc¬ 
ture and the small number of well-defined arith¬ 
metic operations, these net^'orks are well matched 
to integrated circuit technology. The high den¬ 
sity of modern technologies lets us implement a 


large number of identical, concurrently operat¬ 
ing processors on one chip, thus exploiting the 
inherent parallelism of neural networks. The regu¬ 
larity of neural networks and the small number 
of well-defined arithmetic operations used by 
neural algorithms greatly simplify the design and 
layout of VLSI circuits. 

But processing speed is not the only constraint 
on neural network hardware design; neural net¬ 
work classifiers benefit from a highly structured 
topology with local receptive fields. Of particu¬ 
lar importance are convolutional architectures in 
which neurons with identical weights process 
different parts of the input or internal state. This 
topology builds into the network knowledge 
about locality of data, improving recognition per¬ 
formance. At the same time it lets us multiplex 
neurons with identical sizes and realize the large 
networks required for difficult classification tasks 
within the density limitations of current VLSI tech¬ 
nology. The high speed of VLSI technology— 
five orders of magnitude greater than that of 
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natural neurons—compensates for the loss of computing 
speed resulting from paitial serial processing in such an imple¬ 
mentation. 

Matching the arithmetic precision of the hardware to the 
requirements of neural networks is caicial for an efficient 
hardware implementation. Neural networks are often quoted 
for their low-resolution requirement, which lends itself well 
to analcjg implementation. Experiments, however, show that 
the precision requirements of neurons within a single net¬ 
work vary. Specifically, research indicates that higher accu¬ 
racy is often needed in the output layer, for example, for 
selective rejection of ambiguous or otherwise unclassifiable 
patterns. This situation can be handled with a hybrid archi¬ 
tecture, which evaluates the bulk of the network with low- 
resolution analog hardware but implements selected 
connections on a digital processor with higher accuracy. 

Based on these considerations, we designed and fabricated 
ANNA (Anificial Neural Network ALU), a special-purpose chip 
for neural network pattern classification."^ The chip features 
4,096 individual synapses that can be multiplexed to imple¬ 
ment networks with several hundred thousand connections. 
We can program the number of synapses per neuron to val¬ 
ues between 16 and 256; the number of neurons varies ac¬ 
cordingly between 256 and 16. Implementable network 
topologies include fully connected architectures with or with¬ 
out feedback, local receptive fields, and TDNNs (time-delay 
neural networks) or higher order convolutional connections. 
Depending on the network topology, the chip can sustain a 
performance of up to 5 x lO'-' connections per second (C/s). 
Arithmetic operations take place with 6-bit resolution on the 
weights and 3-bit resolution on the states. 

Because of the high speed, parallelism, low resolution, 
and other characteristics of ANNA, which differ considerably 
from those of general-purpose computers, this hardware must 
be readily accessible to the network designer, so new appli¬ 
cations can be prototyped easily. A development system, con¬ 
sisting of a PC or VME board containing an ANNA chip and a 
digital signal processor (DSP), addresses these needs. We 
download the topology and weights of a network into the 
VME board, and the DSP issues control commands to ANNA. 
I'he DSP also preprocesses and trains the networks and pro¬ 
cesses operations that require higher precision than that of 
ANNA. 

We selected an optical character recognition neural net¬ 
work to test and demonstrate the flexibility and power of the 
neural network chip.^ This network identifies handwritten 
digits from a 20 x 20-pixel input image and employs neurons 
with local receptive fields as well as a fully connected layer, 
fl’he network with over 133,000 connections fits on one ANNA 
and is evaluated at a rate in excess of 1,000 characters per 
second, which constitutes a speedup of two orders of magni¬ 
tude over a DSP-based implementation. Despite the low reso¬ 
lution of the chip, the error rates of the neural network 


Our OCR neural network with 
133,000 connections identifies 
handwritten digits at 1,000 cps 
and fits on one ANNA chip. 


processor and DSP implementation are very similar at 5-3 
percent and 4.9 percent. For comparison, the measured hu¬ 
man performance on the same database is 2.5 percent errors. 

Neural network hardware 

Artificial neural networks that solve difficult problems in 
areas such as speech recognition and synthesis, or pattern 
classification, consist of thousands of neurons with tens or 
hundreds of inputs each. Every neuron computes a weighted 
sum of its inputs and applies a nonlinear function to its re¬ 
sult. Architectural parameters, such as the number cT inputs 
per neuron, and each neuron's connectivity vary consider¬ 
ably within a network, and from application to application. A 
special-purpose neural network processor must be flexible 
and powerful enough to accommodate a wide range of ap¬ 
plications. At the same time, the requirements must be care¬ 
fully balanced and the special nature of the task exploited to 
bring an efficient implementation within reach of today’s 
technology. 

We can distinguish two phases of operation in many neu¬ 
ral network applications. During the learning phase, the to¬ 
pology and weights of the network are determined from a 
labeled set of examples using a rule such as backpropagation,"^ 
or a network-growing algorithm.^' In the subsequent retrieval 
or classification phase, the network parameters are fixed. 

The network recognizes patterns based on information 
stored in the architecture and weights during training. Since 
the computational and infrastructure requirements (training 
database) during the learning phase are considerably more 
complex than those for classification, efficiency considerations 
call for separate hardware for learning and retrieval. Network 
parameters detemiined during learning are downloaded into 
processors specialized for the classification task, lliis approach, 
which we focus on here, contrasts with implementations of 
neural network processors with on-chip learning.^" Those 
circuits are not suitable for the pattern recognition problems 
we investigate here, because of limitations of the training 
algorithms implemented on these chips or because of the 
limited size of the network that can be trained. 

The basic operation performed by a neuron during classi- 
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fication is a weighted sum, followed by a nonlinear squash¬ 
ing function / typically a hyperbolic tangent or approxima¬ 
tion thereof: 

V = /(S, x^u) + h). 

We generally refer to the inputs x,- of the neuron as con¬ 
nections and the it) parameters as weights. Each input is 
either tied to the output y of another neuron or to an external 
input. Optionally, a bias b may be added to the weighted 
sum. 

The total number of connections in neural networks for 
applications such as handwritten character recognition may 
amount to 10,000 to several hundred thousands.^ Networks 
that solve more general problems, such as recognition of 
entire words instead of isolated characters, require even larger 
numbers of connections. The speed requirements of typical 
applications call for a few tens to several thousands of classi¬ 
fications per second. For each classification, the network must 
evaluate one multiplication and one addition for every con¬ 
nection, which translates to a few billion multiply-add opera- 


ANNA evaluates dot products of 
state and weight vectors and 
applies a nonlinear squashing 
function to results. 


tions per second. Only parallel implementations, in which 
several connections are evaluated concurrently, achieve such 
computational power. 

The most general network topology pemiits connections 
between any two neurons. Such a high degree of (possible) 
connectivity, combined with the need for parallel process¬ 
ing, results in enormous hardware requirements, and there¬ 
fore calls for a compromise. Usually, the neurons in a netw'ork 
are arranged in layers, each of which receives inputs only 
from neurons in the previous layer. Layers may be fully con¬ 
nected; that is, each neuron may be connected to every neu¬ 
ron in the preceding layer. Often, however, we use local 
connectivity to express knowledge about the problem (geo¬ 
metric relations such as the neighborhood of pixels in an 
image) in the network architecture and thus improve the rec¬ 
ognition performance.* 

For example, the fact that some pixels in an image are 
adjacent to each other can be built into the network architec¬ 
ture by constraining neurons to receive inputs only from neigh¬ 


boring pixels. In a fully connected topology, such informa¬ 
tion must be derived from the training set during the learning 
phase, usually meeting with only partial success. 

A neural network processor could be designed to imple¬ 
ment only networks with fully connected topology. Local 
connectivity would then be realized by simply setting the 
weights of unused connections to zero. Since, in typical neu¬ 
ral networks, the ratio of such unused connections to actual 
connections is easily 100, such an implementation is unac¬ 
ceptably inefficient. The added complexity of the hardware 
required to support local connectivity is no match for the 
millions of connections saved. 

Another challenge for a compact hardware implementa¬ 
tion of a classifier is the amount of memory needed for stor¬ 
ing several tens or hundreds of thousands of weights. 
Fortunately, the weights of many neurons in important con¬ 
nection topologies, including time-delay or feature extrac¬ 
tion neural networks,' -'^ are identical. In these architectures 
the connection topology corresponds to a one- or higher 
dimensional convolution, followed by the nonlinear squash¬ 
ing function, as is illustrated. We can realize such a structure 
with a single, time-multiplexed neuron with a corresponding 
saving of storage and computing devices. 

We can further optimize the hardware complexity by match¬ 
ing the computational accuracy of the processor to the re¬ 
quirements of typical neural networks. Both experience and 
theory*" indicate that neural network classifiers can be de¬ 
signed to be insensitive to low-resolution arithmetic. Experi¬ 
ments with character recognizers show that the recognition 
performance remains virtually unchanged when the inputs 
and outputs of the neurons are quantized to 3 bits, and the 
weights to approximately 5 bits. Higher resolution is required 
in the last layer for the rejection of ambiguous or undassifiable 
patterns. Since in typical neural neux'orks the output layer 
contains only a small fraction of the total number of connec¬ 
tions, we reduce system complexity by evaluating those con¬ 
nections on a different proce.ssor with higher accuracy. 

ANNA 

Figure 1 shows the building blocks of ANNA, a neural net¬ 
work chip that implements the concepts just outlined. It con¬ 
currently evaluates several dot products of state and weight 
vectors and applies a nonlinear squashing function to the re¬ 
sults. Data enters the chip through a shift register, which reads 
up to four values at a time. A file with 16 vector registers stores 
intemiediate results when multilayer networks are evaluated. 

The 64-word-wide (3 bits per word) shifter reads up to four 
inputs in each cycle. In this proce.ss, the current shifter con¬ 
tents shift left one to four word positions. The use of a shifter 
limits the number of pins required and supports convolutional 
network topologies and multiplexing of neurons with identi¬ 
cal weights. The shifter alone handles one-dimensional con¬ 
volutions, while an external data formatter; for example, a line 
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delay register, is needed for two- or 
higher dimensional computations. Data 
loaded into the chip can be buffered tem¬ 
porarily in a file of 16 vector registers to 
reduce the required input bandwidth, to 
evaluate neurons with more than 64 in¬ 
puts, or to store intermediate results. 

Eight banks of vector multipliers per¬ 
form the actual computation. Each bank 
consists of a latch to hold the state vector 
plus eight vector ALUs with 64 synapses 
each. A multiplexer that can be config¬ 
ured to combine the contributions from 
one to four vector multipliers connects the 
outputs from the vector multipliers to the 
neuron bcidies. When the latches of sev¬ 
eral vector multiplier banks hold different 
data, the network evaluates neurons with 
up to 256 inputs. The number of neurons 
depends on the number of inputs: Ex¬ 
tremes of 16 neurons with 256 inputs 
each, or 256 neurons with 16 inputs, as 
well as many intemiediate arrangements 
are possible.The topology can be rear¬ 
ranged on a per-instmction basis to per¬ 
mit evaluation of several layers of a 
network with different architectures on a 
single chip without performance penalty. 

The neuron bodies first scale the out¬ 
put from the vector multipliers by a factor 
that can be set in the range 1/16 to 1/2 in 
eight levels to optimize the useful dy¬ 
namic range of the circuit. Tlien the neu¬ 
ron bodies evaluate the squashing 
function and convert the result to the 
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digital/analog converters (DACs) update 
the values of two different synapses in 
each clock cycle for a refresh speed of 
110 ps for the entire an'ay. 

The chip is programmed with three in- 
stRictions, CALC, SHIFT, and STOId-/ to 
perform computations, load data from an 
external data source, and transfer data 
between the shifter, register file, and vector multiplier banks. 
Parameters for each instaiction determine shift count, source 
of data, number of inputs per neuron, or neuron gain. The 
CALC instmction executes in four 50-ns cycles, and the net¬ 
work evaluates the other two operations in one clock cycle 
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Figure 1. Block diagram of the neural network chip, ANNA. 


concurrently with an ongoing CALC instruction. In 200 ns the 
chip can, for example, load eight states and store them in a 
register and two latches, and evaluate the dot product and 
nonlinear function of eight vectors with 256 components each. 
Tlie weight refresh takes place simultaneously transparent to 
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Neural networks 


Table 1. System features. 

Characteristic 

Value 

Synapses 

4,096 

Bias units 

256 

Synapses per neuron 

16 to 256 

Weight accuracy 

6 bits 

State accuracy 

3 bits 

Input rate 

120 Mbps 

Output rate 

120 Mbps 

On-chip data buffers 

4.6 Kbits 

Computation rate (sustained) 

5 Geps 

Refresh (all weights) 

llOgs 

Clock rate 

20 MHz 



Figure 2. Die photograph. The synapse array can be seen 
in the center, the shifter and register file on the left, the 
neuron bodies at the top, and the weight refresh DACs on 
the right. 


the user. Table 1 summarizes the feaaires of the chip. 

The chip contains 180.000 transistors and measures 4.5 x 7 
mm- (see Figure 2). It was fabricated in single-poiysilicon, 
doul-)le-metal. 0.9-|im CMOS technology with a 5V power sup¬ 
ply. The cun'ent drawn by the chip reaches 250 mA when all 
weights are programmed to their maximum value but is less 
than 100 mA in typical operation. 

Programmability is one of the key features of the neural 
network chip. Table 2 lists a selection of network topologies 
that can !x^ implemented and the achieved perfomiance in 
each case. 'Fhe chip processes networks with full or sparse 
connection patterns of selectable size, as well as networks 
with feedback at a sustained rate of over 10‘^ connections per 
second. 


Table 2. Sample network architectures 
and performance. 


Average performance 
Network topology (GC/s) 


Fully connected (one layer) 


64 inputs, 64 outputs 

2.1 

128 inputs, 32 outputs 

1.2 

32 inputs, 128 outputs 

1.2 

Local receptive fields 


64 X 1,64 features 

2.3 

16 x16, 16 features 

4.7 

16 X 8, 32 features 

3.6 

Multilayer network 


64 inputs, 32 hidden. 


32 hidden, 32 outputs 

0.8 

Hopfield neural network 


64 neurons 

2.1 


Of particular importance for neural network pattern classifi¬ 
ers are neurons with local receptive fields and weight sharing, 
such as 'I'DNNs.^ The neural network chip supports weight 
sharing in several ways. The shifter and register file enable 
loading of data and the computation to go on in parallel. Also, 
data that has lx?en loaded onto the chip once can be buffered 
and reused in a later computation. Finally, rather than requir¬ 
ing separate hardware for all weights, neurons with identical 
parameters are stored only once. 

Development system 

As mentioned Ixffore, the characteristics of the ANNA chip— 
high speed, parallel computation, limited instruction set, and 
low resolution—differ considerably from those of general-pur¬ 
pose computers. Efficient algorithms that derive optimal Ixm- 
efit from the special processor can be designed only if the 
processor is available in the early design stages. A develop¬ 
ment system consisting of an ANNA, a workstation, and ap¬ 
propriate softw'are addresses this requirement. Figures 3 and 4 
illustrate the hardware setup, which includes an ANNA and a 
20-Mflops DSP32C with 1-Mbyte fast static PAM. 

A DMA interface that directly maps the SRAM into the ad¬ 
dress space of the PC bus or VMEbus exchanges data with the 
host computer. The DSP, which is also used for pre- and 
postprocessing and for computations that require higher pre¬ 
cision than that of ANNA, generates instaictions for ANNA. 
The entire system is controlled by a program mnning on the 
workstation that calls routines and exchanges data with the 
DSP transparently to the user. The soft^\'are for the system is 
written in the high-level language C++, with the exception of 
a few time-critical routines that are handcoded in DSP assem- 
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VME- or PC/AT bus 


Figure 3. Block diagram of the neural network accelerator 
board. 

bly language. 

Networks are trained on the workstation or the DSP. Tlie 
neural network chip can also be included in the training pro¬ 
cess, for example, to adjust the network to ANNA'S low-reso¬ 
lution processing. Training of individual chips is not necessary, 
however, because of the good matching between individual 
devices. Once trained, the network topology and weight val¬ 
ues are downloaded into the DSP for execution on ANNA. 

Character recognition 

Speed, capacity, and programmability are important aspects 
of neural netwcjrk hardware. Their practical relevance, how¬ 
ever, mu-st be proven on a real-world application, such as the 
implementation of an optical digit recognizer' on the neural 
network chip we describe here. This network has been trained 
with the backpropagation algorithm^ to recognize handwritten 
digits from a 20 X 20-pixel image. The classification error rate 
on a te.st .set consi.sting of 2,000 handwritten digits is 4.9 per¬ 
cent miss classifications, compared to a human perfonnance 
of 2.5 percent on the same data. 

Figure 5 illu.strates the architecture of the network; Table 3 
lists .statistical informaticrn about each layer. The more than 
3,500 neurons with a total of over 133,000 connectioas are 
arranged in five layers. The first four layers employ a 2D con¬ 
volutional topology with various kernel sizes and subsampling 
factors. Because of weight sharing, the number of weights 
(free parameters) in the.se layers is much smaller than the num¬ 
ber of connections. The last layer is fully connected. We chose 
this topology to maximize recognition performance and classi¬ 
fication speed of an implementation on a floating-point 
DSP32C digital signal processor.'^ 

Special steps are necessary to adapt the network to the low 
resolution of the chip. Simple quantization of all weight values 



Figure 4. Neural network accelerator with ANNA and the 
DSP32C. 


10 outputs 



# Neuron 

[3 Receptive field of neuron 


Figure 5. Architecture of the character recognition 
network. 


Table 3. Connectivity of character 
recognizer neural network. 


Layer 

Neurons 

C/S 

Weights 

5 

10 

3,000 

3,000 

4 

300 

1,200 

12 

3 

1,200 

50,000 

500 

2 

784 

3,136 

4 

1 

3,136 

78,400 

100 
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wmmammam 3,000 conn. eval. by DSP 



130,000 Connections 
eval. by NN-Chip 





J j 

INPUT (20x20 Pixels) 


Figure 6. Sample chip output for optical character recognition. The gray levels encode the neuron state. 


results in an unacceptable loss of accuracy. However, experi¬ 
ments reveal that the computational accuracy provided by the 
chip is adequate for all but the 3,000 weights in the last layer 
of the network. This last layer is retrained with quantized data 
obtained from the chip to eliminate performance degradation. 
After retraining, the classification error rate on the test set is 5.3 
percent, compared to the original 4.9 percent. This result is 
obtained consistently with different chips for which the last 
layer has not been retrained individually. Figure 6 shows the 
input, output, and internal states of the neural network for a 
sample input that has been processed by the neural network 
chip. 

The first four layers of the network with 97 percent of the 
connections but only 6 I 6 weights fit on a single neural net¬ 
work chip. The remaining 3,000 connections of the last layer 
are evaluated on the DSP32C. I’he throughput of the chip is 
more than 1,000 characters per second or 130,000 connections 
per second. This figure is considerably lower than the peak 
perfonnance of the chip (5G connections per second), a con¬ 
sequence of the small number of inputs of most neurons in 
the network for which the chip cannot fully exploit its parallel¬ 
ism. Nevertheless, the chip’s performance compares favorably 
to the 20 characters per second that are achieved when the 


entire network is evaluated on the DSP32C. The recognition 
rate of the chip is far higher than the throughput of the prepro¬ 
cessor, which relies on conventional hardware. Improvements 
of both the recognition rate and accuracy can be expected 
when the network architecture is tuned to take full advantage 
of the parallelism of the ANNA chip. 


Neural networks are atiracttve for pattern ciassi- 

fication applications but suffer in practice from the limited 
speed that can be achieved with implementations based on 
classical processors. This problem can be overcome with 
highly parallel special-purpose VI.SI circuits. While a fully par¬ 
allel implementation of sufficiently large networks is currently 
not feasible, we can achieve adequately high perfonnance with 
an architecaire that exploits the limited connectivity and weight 
sharing that are typical for pattern classifiers. We demonstrated 
this performance with a neural network classifier with over 
133,000 connections that has been implemented on a single 
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neural network chip performing over 1,000 classifications per 
second. This result eliminates throughput from the constraints 
faced by network designers. The availability of fa.st special- 
purpose hardware for large applications .sets the conditions to 
explore new neural network algorithms and problems of a 
scale that would not be feasible with conventional proce.s.sors. 

We expect further advances when the architecture of the 
network is modified to fully take advantage of the chip's par¬ 
allelism. While tire size of the cun ent network has been con- 
■strained by the speed of conventional hardware, such is.sues 
vanish because of the high speed of the chip. Tlie price for 
this throughput is the .specialization of the circuit, specifically 
its low resolution, and its focus on neural network algorithms. 
Future research will Ixinefit from the speed of the novel hard¬ 
ware but must also addre.ss questions regarding the limitations 
of special-purpose hardware. Furthemiore, it appears attrac¬ 
tive to implement larger tasks (to include image location, seg¬ 
mentation, and scaling into the recognition process) with 
neural networks to benefit from tlie powerftil hardware. ID 
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Experimentation with 
Hypercube Database Engines 


Effective algorithms appropriately selected for each task are essential if we are to take full 
advantage of parallelism in database systems. Some algorithms perform faster than others, 
depending on the uniqueness value of the data and the uniformity of distribution among the 
nodes in the system. A performance evaluation under varying experimental parameters de¬ 
termines the optimality of a given algorithm for a particular database task. 
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Arun K. Sood 

George Mason University 


B merest in multiprocessors for database 
processing has grow'n l)eyond the re¬ 
search communities, into the devel¬ 
opment sectors. As the volume and 
variety of data stored, accessed, and manipulated 
grows, computer and communications designers 
are focusing on parallelism-in particular using 
general-purpose commercial multicomputers-as 
a way to meet database processing needs. 

Using one such computer, Intel’s iPSC/2 
hypercube, we measured the relationship be¬ 
tween packet size, method of clustering messages, 
and internode traffic on the total sustained com¬ 
munication bandwidth. Having measured the 
costs associated with intemode communication, 
we then analyzed duplicate removal algorithms. 
Duplicate removal is an integral part of several 
frequently used relational database operations, 
including PROJECT and UNION. We believe it 
is, therefore, important to develop efficient 
algorithms. 

We also studied the effects of nonuniformly 
distributed attribute values and tuples across pro¬ 
cessors on three proposed duplicate removal al¬ 


gorithms. We chose algorithms to represent the 
several available in the literature.We then evalu¬ 
ated the output collection time. 

A tutorial on multiprocessor/multicomputing 
databases*’ or current generation multicomputing 
architectures^ is beyond the scope of this article. 
We intend only to present a brief overview of the 
iPSC/2’s hypercube message-passing system, and 
discuss the results of our experimentation and 
analysis. Although our algorithms were imple¬ 
mented on the iPSC/2, we believe they would 
work as well on other hypercul:x^ machines, such 
as Ncube‘^ and Floating Point Systems' T series.^" 

iPSa2 

Intel’s Personal Super Computer, the iPSC/2. is 
a multiple-instruction, multiple-data (MIMD), 
multicomputer with nodes connected via a 
hypercube topology. Although many definitions 
for a multicomputer system exist, we define it as 
a system comprised of multiple nodes, each of 
which is a complete computer. That is, each node 
has a set of resources (dedicated I/O, memory, 
and CPU) and is under the independent control 
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of its own operating system. 

Message-passing application interface. Application pro¬ 
grams acxess the iPSC/2’s message-passing system through 
system calls. The NX/2 operating system provides three 
levels of system calls: synchronous, asynchronous, and inter¬ 
rupt-driven. Users can mix and match these three types of 
calls. For example, a message sent with an asynchronous call 
can be received either with a synchronous call or an inter- 
Ript-driven call. The size of the message is limited only by 
the amount of physical memory available at the node. 

The processor uses an application-development message¬ 
passing paradigm. Consider, for example, a node program 
that receives data from the host or from another node and 
then locally processes them. If the node cannot locally pro¬ 
cess without the received data, then it must use a synchro¬ 
nous receive call to receive the data. 

On the other hand, if the node can accomplish some of 
the local processing without the received data, then it uses 
an asynchronous receive. After initiating the asynchronous 
receive, the node carries out the local processing that does 
not require the received data. I'he node program then checks 
if the data have been received. If the data have not been 
received, the program waits. 

The synchronous calls are also called blocking calls. When 
a node sends a message using this protocol, it is blocked for 
processing until the operating system copies the send buffer 
into its local buffer and is ready to send this message to its 
destination. This operation does not imply however that the 
message sent is received at the destination node. The mes¬ 
sage buffering and flow control in NX/2 ensure the synchro¬ 
nous transmission intended by the application. 

Tlie asynchronous calls are nonblocking calls. When a node 
sends a message using this protocol, the sending node is free 
t(3 process soon after the execution of that call. The receiving 
node must poll for any messages received asynchronously. 
This protocol facilitates the application to perform other tasks 
while the messages are in transit. 

Message-passing measurements. We conducted the fol¬ 
lowing experiments to analyze the communication behavior 
of iPSC/2’s message-passing system. We measured the com¬ 
munication times betuxen two adjacent nodes that are also 
at the extreme diagonals on the hypercube. The communica¬ 
tion times for a one-hop (adjacent nodes) and multihop 
(nonadjacent nodes) are almost equal, because the interme¬ 
diate nodes take only a few microseconds to help establish a 
path. Thus, their computation phase is uninterrupted. This 
indicates that the iPSC/2 message-passing interconnection, 
though organized as a hypercube topology, practically func¬ 
tions as a fully connected interconnection network. 

A Rilly connected 128-node system has eight simultaneously 
available paths, that is, paths without contention. However, 
If multiple nodes simultaneously send messages to a com¬ 
mon destination, then contention is imminent. To evaluate 




Message size (Kbytes) 


Figure 1. Communication time as a function of message 
size: small (a), large (b). 

the effects of contention we measured the communication 
times when one or more source nodes send messages to a 
single receiving ncxle. The communication times at the source 
nodes slow down proportionally to the number of source 
n(3desincreases. 

In our experiments, we used an eight-node system and 
measured average communication time of 1,000 messages in 
which each node sends to and receives from another node. 
Figure 1 shows the communication time for varying message 
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Figure 2. Communication bandwidth as a function of mes¬ 
sage size: small (a), large (b). 

sizes between neighbor nodes and nodes three hops apart. 
Figure 2 illustrates the computed bandwidths of the commu¬ 
nication channel for different operating conditions. Note that 
the communication time for longer messages is comparable 
to that of smaller messages. Consequently, we prefer imple¬ 
mentations requiring fewer but longer messages. 

We measured the communication time for a zero-byte mes- 


Source node Destination node 

Establish path and transmit data 

-► 

(a) 

Source node Destination node 

Establish path and transmit data 
1 --► 

Transmit status acknowledgement ^ 

Transmit data 

3 -^ 

(b) 


Figure 3. Message transmission protocols: one-step (a), 
three-step (b). 


sage at 361 ps. This is the minimum setup time required to 
transmit a message of any size. In addition to the seaip lime, 
messages larger than zero bytes need a propagation time. 

The system uses two protocols for message transmission. 
For short messages (zero to 100 bytes), a one-step protocol, 
shown in Figure 3a, sets up a path and transmits data 
concurrently. 

Figure la shows that, for messages larger than 100 bytes, 
the communication time increases by almost 280 ps. (Since 
all messages include a 4-byte boundary, a 100-byte message 
actually takes 104 bytes.) The time increases because the 
system uses a three-step protocol, shown in Figure 3b, for 
larger messages. In the first cycle, the source node estab¬ 
lishes a path by sending a status message to the destination 
node. In the second, if the destination node is ready to re¬ 
ceive a large message, it sends back a status acknowledg¬ 
ment. In the final step, the source node sends the complete 
message to the destination node. 

I'his three-step protocol increases the communication time 
for large messages, but two different protocols are required 
since the one-step protocol could overflow the memory buffer 
if used for large messages. Alternatively, the three-step method 
would add unnecessary protocol overhead if used for small 
messages. 

Figure lb shows the communication time chart for mes¬ 
sages between 256 and 4,096 bytes. Note that communica¬ 
tion time for extreme diagonal nodes and adjacent nodes is 
almost equal. Figure 2a shows the bandwidth chart for mes¬ 
sages from 0 to 256 bytes. Figure 2b shows the bandwidth 
chart for messages in the range from 256 bytes through 64 
Kbytes. To achieve a 2.7-Mbyte/s bandwidth, the message 
size must be at least 64 Kbytes. For small messages, say 100 
bytes, the bandwidth is 0.21 Mbytes/s. 
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Our experiment led us to the following conclusions: 



Figure 4. Number of common destinations versus time. 


We also measured communication times for message trans¬ 
missions with contention in the network. Contention is cre¬ 
ated by transmitting messages from one or more source nodes 
to a common destination node. To demonstrate, we chose 
an eight-node hypercube with an arbitrarily chosen destina¬ 
tion node 5. Nodes 0 through 4, 6, and 7 transmit messages 
to node 5 and the transmission times are recorded. 

Figure 4 shows the number of common destinations ver¬ 
sus message communication time. Each curve represents a 
different message size. The message communication time 
increases almost linearly, as the number of common destina¬ 
tions increase. For example, sending a 100-byte message from 
a single source node takes 501 ps, while for seven source 
nodes, the communication time is 3,006 ps. For every 100 
bytes of memory, the communication time increases approxi¬ 
mately 400 ps for each common destination node added. 
Similarly, for a 104-byte message, the communication time 
increases approximately 800 ps for each common destina¬ 
tion node. (The dramatic increase for a 104-byte message is 
due to the three-step protocol.) 

We also observed that the communication lime increase 
caused by contention is independent of the location of the 
common destination node. This increase stems from the com¬ 
munications technology employed in the iPSC/2. These mea¬ 
surements suggest that, to achieve scalable performance, we 
should use algorithms that avoid network contention. 


• Designers should choose an appropriate message size 
based on the requirements of an application. The three- 
step protocol causes extra overhead in the transmission 
of messages. Thus, if the message size is only slightly 
over 100 bytes, it may be advantageous to reduce the 
message size. 

• For short ynessage transmissions, the iPSC/2 offers a quite 
low bandwidth, on the order of Kbytes. Thus, message 
clustering may be required to achieve higher perfor¬ 
mance. In message clustering, one or more messages 
destined for the same target node are collected in a packet 
and transmitted as a single message packet. However, 
message clustering increases processing overhead. For 
some applications, this increased overhead may nullify 
the benefits of using the high bandwidth. 

• The iPSC/2 effectively supports a fully connected nettvork. 
With the notable exception of minimizing node conten¬ 
tion, there are few additional benefits to partitioning the 
application to suit the adjacent node communications. 

• Simultaneous data transfer to the same node dramati¬ 
cally increases the communication time. Consequently, 
we prefer algorithms that avoid such contention. 

We designed the algorithms presented in the following sec¬ 
tions with these communication constraints in mind. 

Relational database 

Numerous commercial relational database systems are avail¬ 
able that execute on a range of host processing systems in¬ 
cluding personal, mini- and mainframe computers, as well as 
a host of multicomputer environments. Each system’s perfor¬ 
mance depends on the underlying execution environment 
and the liberties taken by the software developer. For ex¬ 
ample, duplicate tuples (or rows) are not allowed in the ac¬ 
tual relational database model,and yet several vendors 
support duplicate tuples. We initially describe a subset of 
relational operators and then provide a discussion of three 
duplicate removal algorithms that can be employed as part 
of the implementation of these operators. 

Operators. The relational database mcxlel is the underly¬ 
ing environment for a wide diversity of applications. For ex¬ 
ample, medical pictorial databases, commonly called picture 
archiving and communication systems (PACS), employ the 
relational mcxlel to store and access patient data.'^ Some de¬ 
signers propose protocol verification systems that use the rela¬ 
tional model to implement an efficient algorithm based on a 
reachability analysis technique.*^ Thus, the development of 
high-perfomiance parallel algorithms for the relational opera¬ 
tors impacts not only the traditional consumer database pro¬ 
cessing arena but also nontraditional database domains, 
including the communications and medical communities. 
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Table 1. 

Relations r and s. 


Relatione 


Relation s 


Employee 

number 

Aqe 

Employee 

number 

Children 

Gender 

0 

33 

0 

3 

M 

1 

25 

1 

0 

M 

2 

53 

5 

3 

F 

3 

25 

7 

2 

F 

4 

25 

8 

1 

F 

5 

45 

10 

4 

M 

6 

28 

16 

8 

F 

7 

45 

17 

4 

F 

8 

64 

18 

2 

M 



22 


F 



28 

0 

M 



31 

0 

F 



42 

3 

F 



43 

2 

M 


To review relational database nomenclature, consider the 
relational database shown in Table 1. This database consists 
of two relations, r and s. Relation r has two attributes: em¬ 
ployee number and age. Relation s has three attributes; em¬ 
ployee number, number of children, and gender. 

Each attribute is defined over some finite or countably in¬ 
finite domain. In this example, the attributes in rare defined 
over the set of whole numbers and the set of whole numbers 
between 0 and 120. The attributes in s are defined over the 
domains of the set of whole numbers, the set of whole num¬ 
bers between 0 and 30, and the set {M, F}. Relation r consists 
of nine tuples; each tuple contains two attributes: 

{<0, 33>, <1, 25>. <2, 64>, <3, 25>, <4, 25>, <5, 53>, 

<6, 28>, <7, 45>, <8, 64>}. 

Similarly, s consists of 14 tuples, each comprised of three 
attributes. 

Several popular operators exist for the manipulation of the 
relations comprising the database. Here, we briefly overview 
only four: Project, Union, Select, and Join. Formally speci¬ 
fied. these operators are defined as follows. 

• Project. The projection on P[XYZ\, denoted as 
defined by = { x[A] I xe P], where A is a set of 
attributes of P. 

• Union. The union of two relations, P [XYZi and Q [XYZ\, 
denoted as P[XYZ[ u Q[XYZ[, is defined by P[XYZi u 
Q [XIZ] = [ X \ x^ Por xs Q], where X, Y and Zare 


a disjoint set of attributes. 

• Select. The selection on P[XY^, denoted as 

where © g {<, <, =, >, >, is defined by = 1 x 

I x[Bi = b, X e P], where B is an attribute of P. 

• Join. The join of two relations P[XYZ\ and Q[VWX\, de¬ 

noted as P\XY^ 1.x! QlVVm, is defined by P[XYZ[ 
Lxl Q[VW)^ = [u \ pe. P, qe Q,p[^ = u[X}^ 

= p, u[VW>^ = (?}, where V,W,X,Y, and Zare a disjoint 
set of attributes. If no common joining attributes exist, 
the join of P and Q is the Cartesian product of P and Q. 

Given relations rand sas shown in Table 1, the user re¬ 
quests, or queries, are evaluated in Figure 5. 

In typical database applications, most attributes contain du¬ 
plicate values. Duplicates occur frequently in database pro¬ 
cessing due to the inherent nature of the data being processed. 
For example, George Mason University has about 20,000 .sRi- 
dents in 43 undergraduate programs. Therefore, the program 
discipline attribute in the sUident database contains a high 
degree of duplicates, that is, a low uniqueness factor. 


1. Find alt employees with three children. 


*^Chiidren=3 (5) “ 



Employee number 

Children 

Gender 

0 

3 

M 

5 

3 

F 

42 

3 

F 


2. Find all employees that are childless or are 25 years old. 

^Emp No, (^Children=0 (■^)) ^ ^Emp No. (^Age=25 (0) 

Employee Number 
1 

3 

4 
28 
31 

3. Determine all the available information about 45-year-old 
women. 

^Age=45 ^Gender=F 


Employee number 

Age 

Children 

Gender 

5 

45 

3 

F 

7 

45 

2 

F 


Figure 5. Queries. 
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If an attribute value in the local database is also present in the received packet, then 
{ 

if (locaLcount > packet_count) then 

locaLglobaLcount = locaLglobaLcount + packet_count 
else if (locaLcount < packet_count) then 

delete the tuple from the local database 
else if (locaLcount = packet_count) then 
{ 

if (locaLnode_address > packet_node_address) then 

locaLglobaLcount = locaLglobaLcount + packet_count 
else 

delete the tuple from the local database 



Figure 6. Multiple-message ring algorithm (local processing). 

In the database presented in Figure 5, duplicate entries ap¬ 
pear in the numlier of children and employee gender attributes 
of s, and in the age of the employees attribute in r. In fact, in 
mo.st ca.ses, only the key attributes are typically unique. We fo¬ 
cused on designing algorithms that remove duplicates. 

We also studied the effects of low uniqueness on the perfor¬ 
mance of proposed parallel algorithms. Other authors have 
written on the computational complexity and processing re¬ 
quirements of relational database queries in the pre.sence of 
duplicate attribute values.''''^"* 

Of the relational database operators presented earlier, we 
emphasize Project and Union since both can result in duplicate 
tuple values in the initial .stage of processing, 'fliat is, the Project 
operator initially eliminates columns which may result in dupli¬ 
cate tuple values. Thus the removal of duplicates from the gen¬ 
erated .set of tuples must follow. 

The Union operator eliminates the duplicate copies that re¬ 
sult from the initial processing stage. In the Union operator, the 
initial proce.ssing stage involves the combining of the tuples of 
the first relation with the tuples of the second relation. Dupli¬ 
cate removal comprises the second stage. (In a distributed- 
memory architecture, the first stage can occur without any 
internode communication, whereas, in the second .stage, all 
nodes must communicate.) The duplicate tuples telong to the 
intersection of the input relations to the Union operation. 

Algorithms for duplicate removal. We describe three 
parallel duplicate removal algorithms, the multiple me,ssage 
ring (or, simply, the ring), recursive reduction, and bucket algo¬ 
rithms. Each algorithm removes the duplicates in two .steps: lo¬ 
cally at each node, and then globally throughout the 
hypercube. 

All three algorithms commence with the local duplicate re¬ 
moval pha.se. This pha.se forms a local databa.se of tuples with 
unique attribute values originally available at the given node 
(that is, no two riiples have the same attribtite value). For each 


unique datum value, two fields 
are maintained. One is a local 
repetition count representing the 
number of copies of the datum 
initially present at the node. The 
other is a global repetition count 
that is initialized to the corre¬ 
sponding local repetition count 
before the global duplicate re¬ 
moval stage. 

Since the local duplicate re¬ 
moval phase is identical in all 
three algorithms, we will not 
elaborate on it. We will briefly 
describe the global duplicate re¬ 
moval pha.se for each algorithm, 
though further details of the al¬ 
gorithms, including analytical 
models and perfonnance evaluation, are Iteyond the .scope of 
this article,’ 

Multiple-message ring. After the local duplicate removal 
phase, a unidirectional ring is emlxtdded using Reflexive Gray 
Codes in the log^A^dimensional hypercube. A packet consist¬ 
ing of tuples with locally unique attribute values and their local 
repetition counts is fomied at each ntxle. The global duplicate 
removal is achieved in A'-l iterations. During each iteration, 
each ntxle .sends a packet to the next node, in a pipeline order¬ 
ing of the Reflexive Gray Code, and receives one from the pre¬ 
vious ntxle. The received packet updates the local database as 
sht)wn in Figure 6. This prtxedure repeats for N-\ iterations, 
after which the folktwing are tRie: 

• Attribute values are unique across ntxle bttundaries. 

• The node on which an attribute value is found is the nttde 
which had the largest local repetition count to .start with. If 
more than one ntxle had the largest local repetition count 
to start with, then the node with the largest node addre.ss 
will retain that unique value. 

• Tlie final local_globaLcount of an item is the global rep¬ 
etition count of that value. 

Recursive reduction. The global duplicate removal phase is 
achieved by a log.A^fold recursive reduction of the size of the 
cube containing the data, with the final results placed at a 
single target node. In each .step j, the size of the active cube is 
halved ba.sed on the value of the /h bit in the node address. 
Tliat is, all nodes with the jth bit equal to one belong to the first 
half while the remaining nodes, those nodes with the /h bit 
equal to zero, belong to the second half. 

For each node in one half, a comesponding node exists in 
the other half, where every bit except j is identical in value. 
Nodes that differ from the target node in the /th bit transmit lo¬ 
cal data to their corresponding node. The receiving nodes 
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If a datum in the packet is also present in the local database then 
locaLglobaLcount = local_global_count + packet_count 
else 

add the datum value to the local database at the appropriate location. 


Figure 7. Recursive reduction algorithm (local processing). 


Let N = Number of nodes, k = log^N, Node address = «n^_., n ^_2 ... n^ » 

Let max_value and min_value define the range of attribute values to be sorted. 
Let recursion_no = 0; 
while (recursion_no < k) 

{ 

create lower bucket consisting of attribute values such that 

attribute value < (Maximum attribute value - Minimum attribute value) 

2 

create upper bucket consisting of remaining attribute values. 

Let X = «00 ... 010 ... 0» where the 1 is in the (recursion_no)^^ position. 
Let partner_node = n^_ 2 ... nQ» BIT_WISE_EX_OR X 

recursion no ~ 

{ 

send upper bucket to the partner_node 

receive bucket from partner_node and merge it with the lower bucket 
max_value = min_value + max_value - min_value 

2 


else 


send lower bucket to the partner_node 

receive bucket from partner_node and merge it with the upper bucket 
min_value = min_value + max_value - min_value 

2 

} 

recursion_no = recursion_no +1 

} 


whose yth address bit equals zero keep 
all tuples whose attribute values map 
to the lower attribute value range and 
transit the remaining tuples to their 
corresponding node. Nodes whose/h 
address bit equals one keep all tuples 
whose attribute values map to the up¬ 
per attril'>ute value range and transmit 
the remaining tuples to their corre¬ 
sponding node. Figure 8 outlines the 
pseudocode description of the bucket 
algorithm. 

Communication requirements. 

We designed these three algorithms 
to exploit our knowledge of the iPSC/ 
2’s message-passing system and to 
prevent simultaneous transmission to 
a node. In general, the intemode com¬ 
munication needs of the algorithms are 
restricted to adjacent nodes. 

Unfortunately, the volume of data 
transferred between nodes is data de¬ 
pendent. Hence, it is difficult to maxi¬ 
mize the sustained communication 
bandwidth l^etween ntxles by using 
large packets. When multiple packets 
are required to transfer the data be¬ 
tween the nodes however, we prefer 
fewer large packets to many small 
packets. 

Performance evaluation 

The time required for duplicate re¬ 
moval indicates the efficiency of the 
algorithm. The total time consists of 
three components: 


Figure 8. Pseudocode of bucket algorithm. 

merge their respective local databases with the packets re¬ 
ceived, as shown in Figure 7. On termination, all the unique 
values and their global counts reside in the target ncxle. 

Bucket. 'Fiiis algorithm is a special case of tlie hash join 
algorithnF'^ and can be used whenever the range of attribute 
values in the system is known a priori or can l'>e readily and 
efficiently computed. A bucket consists of attribute values 
within a specified range. E!ach node is mapped to a bucket, 
and the attribute values residing on all other nodes belonging 
to that Ixicket are routed to that node. The routing is done in 
log,steps by following an algorithm similar to the recursive 
reduction algorithm. During each step /. corresponding nodes 
partition the attribute values range into two halves. Nodes 


• local duplicate removal, 

• global duplicate removal, and 

• data collection. 

The ring and bucket algorithms distribute the resulting data 
across all the nodes. If the results must reside at a single 
node, they must be transmitted. The time needed to accom¬ 
plish this is called collection. 

The inclusion of collection time and the duplicate removal 
phase, and the proper selection of the algorithm is applica¬ 
tion dependent. For example, if the starting data are already 
locally unique, the local duplicate processing time need not 
be included. Similarly, if the final results are to be used as an 
input for further processing, the collection time should not 
be included in the execution time. Unless otherwise stated, 
we assume that the local duplicate removal time is included 
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and the collection time is not included tor computing the 
total execution time of an algorithm. 

We carried out experiments by varying the work load char¬ 
acteristics and the system configuration, namely the numl'ter 
of nodes. The cardinality, uniqueness, tuple size, distribution 
of the unique values, and distribution of data among the 
nodes define the work load characteri.stics. We define the 
uniqueness factor (or simply the unicjueness) of the data as 
the number of distinct attribute values expressed in a fraction 
of the total number of attribute values. In our experiments 
the attribute value of a tuple is taken as an integer. We refer 
to this integer valtie as the data value. 

As an example, a typical run may consist of eight nodes, 
8,000 data values, and a uniqueness of 0,4. The database 
randomly generates the required data and then distributes 
the data among the nodes, either uniformly (each node re¬ 
ceives the same number of data values) or nonuniformly. 

We u.sed each of the three algorithms for parallel duplicate 
removal and recorded the execution times. The execution 
times may include the local duplicate removal time and the 
collection time, as described earlier. Since the data are gener¬ 
ated randomly, we expect the execution time of an algorithm 
to change every time the experiment is carried out. To ac¬ 
count for the random behavdor, we conducted the experi¬ 
ments multiple times and observed the deviation of the 
execution times of an algorithm. 

We found a small deviation in the execution times for all 
the runs except those with nonuniformly distributed data 
among the nodes. Therefore, for the data distributed uni¬ 
formly among the nodes, we took the average of five ains as 
the true execution time of an algorithm. For the nonunitbmily 
distributed data, we took the average of 25 runs of each 
algorithm as the execution time. 

Otir objective was to evaluate the peifomiance (tf each al¬ 
gorithm under different work load and .system conditions with 
die eventual goal of dynamically selecting 'the algorithm of 
choice for each application, llie independent variables were 

• the number of nodes, 

• the number of data elements, 

• the size of each element (or tuple), 

• the uniqueness of the data (degree of skew), 

• the initial distribution of the data, and 

• the desired final distribution of the data (di.stributed or 
centralized). 

Note that the choice of the optimum algorithm must lx; de¬ 
termined not only by the performance evaluation but alscr by 
the constraints on the initial data and final results. 

Figure 9a shows the execution times (including the local 
duplicate removal times) of the three algorithms and an opti¬ 
mal algorithm as a function of the uniqueness. The system 
consisted of 16 nodes and 16,000 data values. The optimal 






Figure 9. Execution times including the local duplicate re¬ 
moval times: 16 nodes, 16,000 data values (a); 16 nodes, 
8,000 data values (b); 8 nodes, 16,000 data values (c); and 
8 nodes, 8,000 data values (d). 
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algorithm using nodes requires (1/AOth the time taken by 
an optimal uniprocessor algorithm (a tree duplicate removal 
algorithm, in our case). 

Note that for the bucket algorithm the speedup in the ex¬ 
ecution time is about N/2 when nodes are used. Moreover, 
the speedup is the same for all values of uniqueness. The 
speedups for the ring and recursive reduction algorithms are 
small for high uniqueness values. But the speedups increase 
as the uniqueness decreases. 

Figures 9b, 9c. and 9d show similar plots for 16 nodes, 
8,000 data values; 8 nodes, 16,000 data values; and 8 nodes, 
8,000 data values. The bucket algorithm is insensitive to the 
uniqueness of the data, except for very low uniqueness val¬ 
ues. The recursive reduction is the most sensitive, and the 
ring algorithm is moderately sensitive to the uniqueness. 

The bucket algorithm performs fastest for ail except the 
very low values of uniqueness. For completely unique data 
Ca uniqueness factor of 100 percent), the recursive reduction 
algorithm takes almost twice as long as the ring algorithm. 
But the performance of the recursive reduction significantly 
improves as the uniqueness decreases, due to its sensitivity 
to the uniqueness of data. At lower levels of uniqueness, the 
recursive reduction actually performs belter than the ring. 
This threshold value of the uniqueness changes with the sys¬ 
tem and data conditions, namely the number of nodes and 
the number of data values. For example, for 16 nodes and 
16,000 data values, the threshold uniqueness is 0.28, whereas 
for 16 nodes and 8,000 data values, it is 0.31. 

Three components of the execution time explain sensitiv¬ 
ity towards uniqueness. 

• Local duplicate removal. Consider a database of size 
Al with a global uniqueness p. Divide the database into 
TV parts and let the local uniqueness of each part lx? q. 
We see that, in general, q will be larger than p. In fact, 
for p» \/N, q- 1. For example, for M= 16,000 and N= 
l6, if p is reduced from 1.0 to 0.1, q remains at 1.0. 
Consequently, the local duplicate removal time is not 
sensitive to the global uniqueness p, especially for large 
values of p." 

• Global duplicate removal. This component consists 
of two parts: 

Data transfer. In the bucket and ring algorithms, the local 
uniqueness factors detennine the size of packets to be 
transferred. For both algorithms, the packet size is not 
affected by change in the global uniqueness p. On the 
other hand, in recursive reduction, the size of the cube 
decreases as the recursion progresses and the local 
uniqueness changes from 1.0 (in the first recursion) to p 
(in the last recursion step). As a result, the size of the 
packets to be transfeiTed increases from the first to the 
last recursion and is sensitive to the global uniqueness. 
Local processing. For the bucket algorithm, the size of 


the received packet and the size of the local database 
are unaffected by the change in the global uniqueness. 
As a result, the local processing time of the bucket algo¬ 
rithm is insensitive to global uniqueness. For the ring 
algorithm, the size of the received packets and tlie initial 
size of the local database do not depend upon the glo¬ 
bal uniqueness. As the iterations progress however, the 
local database shrinks. The rate of this decrea.se depends 
on the global uniqueness. The lower the global unique¬ 
ness, the faster the local database shrinks. Hence, the 
local processing time is moderately sensitive to the change 
in the global uniqueness. For the recursive reduction 
algorithm, as explained earlier, both the size of the re¬ 
ceived packets and the size of the local database are 
sensitive to the global uniqueness. Consequently, the 
local processing time of the recursive reduction is highly 
sensitive to the global uniquene.ss. 

Depending on the particular application at hand, the local 
duplicate removal time may or may not be important. If this 
time is not taken into account, the execution times of the 
three algorithms decrease by a fixed amount. Since the local 
duplicate removal time changes with uniqueness, the decrease 
in the execution time-though exactly the same for all three 
algorithm.s-differs for each uniqueness. Even if we do not 
consider the local duplicate removal time, the threshold value 
of the uniqueness, below which the recursive reduction al¬ 
gorithm performs better than the ring algorithm, does not 
change."^ 

The performance of the three algorithms for very low 
uniqueness values, shown in Figure 9, is of interest. Figure 10 
shows execution times of the three algorithms for uniqueness 
varying from 0.001 to 0.01. Note that the ring algorithm is the 
most time-intensive of the three. The recursive reduction per¬ 
forms better than the bucket for most uniqueness values. 

In the ring and the bucket algorithms, the final results are 
distributed among the different nodes. Such a distribution may 
be desired in various siaiations including intemiediate rela¬ 
tions within the execution of a query. In many applications 
however, the final results are needed in one node. In such 
cases, for the ring and bucket algorithms the distributed results 
must be collected and merged at one node. Note that the 
recursive reduction algorithm already includes the collection 
time. 

Figure 11 shows the execution times including collection. 
Tile ring algorithm perfomis slowe.sl. The bucket algorithm’s 
perfonnance degrades but remains better than that of the re¬ 
cursive reduction algorithm, especially for high uniqueness 
factors. For uniqueness values below a threshold, recursive 
reduction is better than bucket. The value of this uniqueness 
threshold depends on the number of ncxles and the data size. 

We have assumed that the data with particular uniquene.ss 
are generated randomly and that the distribution of different 
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Figure 10. Execution times for low uniqueness, including 
the local duplicate removal times: 16 nodes, 16,000 data 
values (a); 16 nodes, 8,000 data values (b); 8 nodes, 16,000 
data values (c); and 8 nodes, 8,000 data values (d). 


Figure 11. Execution times including collection: 16 nodes, 
16,000 data values (a); 16 nodes, 8,000 data values (b); 8 
nodes, 16,000 data values (c); and 8 nodes, 8,000 data val¬ 
ues (d). 
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Figure 12. Execution times for nonuniform tuple counts: 

16 nodes, 16,000 data values (a); 16 nodes, 8,000 data val¬ 
ues (b); 8 nodes, 16,000 data values (c); and 8 nodes, 8,000 
data values (d). 


unique values in the data is uniform. This assumption may not 
always he taie. We studied the effect of a skewed distribution 
of the unique data values on the algorithms by generating the 
data values using a nonunifomi distribution. Figure 12 shows 
the execution times of the three algorithms when we gener¬ 
ated data using a Gaussian distribution with |i equaling the 
mean of the unique values to be generated and a equaling 50. 

As expected, the execution times of all three algorithms 
remain unchanged for high uniqueness. For low uniqueness, 
the execution times of the ring and the recursive reduction 
algorithms decrease. This is not surprising because, if the dis¬ 
tribution of the unique values is skewed, the local uniqueness 
factor on each of the nodes is generally smaller than the ones 
if the unique values were not skewed. Smaller local unique¬ 
ness implies that the local as well as the global duplicate re¬ 
moval times are smaller. 

The perfonnance of the ring and recursive reduction algo¬ 
rithms for the uniformly generated data are the worst case 
perfonnance for a specified uniqueness as far as the skewness 
of the unique values is concerned. In the case of the bucket 
algorithm, the kx:al duplicate removal time decreases, as it does 
for the ring and recursive reduction algorithms. 

The global duplicate removal time increases because the 
design of the bucket algorithm assumes a unifomi distribution 
of data. If the distribution of the data is skewed, the size of each 
bucket will differ. As a result, after a few iterations some nodes 
will be processing and transferring large amounts of data, 
whereas others may be operating on small amounts of data. 
The discrepancies among the volume of the data on different 
nodes increase with each iteration. Hence, in the bucket algo¬ 
rithm, the load-sharing among the nodes is not uniform if the 
data are not generated unifomily. This increases the global 
duplicate removal time of the bucket algorithm. 

The total execution time will either decrease or increase 
depending on the sum total of the changes in the local and 
global duplicate removal times. Comparing Figures 9 and 12, 
we see that the total execution time remains almost constant 
for the bucket algorithm. 

We assumed the distribution of the initial data among the 
nodes was unifomi. This, too, may not be always the case. To 
study the effect of nonunifornily distributed data on the three 
algorithms, we uniformly generated data but distributed it 
nonuniformly to the nodes as follows: 

• Generate a random numloer lietween 0 and Abusing a uni¬ 
form random number generator. 

• Let the number be 

• Randomly select data values and assign them to node 
0 . 

• Now generate a random number between 0 and 
using a uniform random generator. 

• Let that number be x^. 

• Randomly select x, data values and assign them to node 1. 
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• Repeat this procedure to assign the data to the remaining 
nodes, 

• Assign to the last node all data not assigned to any other 
node. 

As c^an be expected, the execution times of the three algorithms 
change substantially every time new data are generated using 
this procedure. Therefore, we repeated the procedure 25 times 
instead of the normal five times. Figure 13 shows the results 
on the nonuniformly distributed data. Note that the execution 
of all three algorithms increa.ses by a factor of 2 to 3. However, 
the bucket algorithm remains the algorithm of choice. 

Another important issue related to the initial data is the mple 
size. So far we have assumed that the data consist of only the 
key, which is a.s,sumed to te an integer value. In reality, a tuple 
consists of a key along with other fields of varying sizes. In 
.such cases, the duplicates are removed by appending the other 
fields of a duplicated tuple. For example, assume that a tuple 
consists of a key and a 10-byte field. If the data contain 50 
duplicates of a key value, after the duplicates are removed, the 
key value will occur only once along with 50 fields of 10 bytes 
each. 

Figure 14 on page 54 .shows the execution times of the three 
algorithms for varying tuple sizes. Note that the executitm times 
of all three algorithms increa.se by an order of magnitude. The 
recursive reduction is the most .sensitive to the tuple size and 
clearly performs slowest for large tuple sizes. The sensitivity 
of the recursive reduction algorithm to the tuple size is con- 
si.stent with its sensitivity to the uniqueness of p and is ex¬ 
plained by examining the three components of the execution 
times, as discussed earlier. Table 2 on page 55 summarizes 
our results. 

Thf: EFFICIENT F^Qd.OITATION OF PAR AIJ.F.I.I SM in a dis¬ 
tributed memory architecture nece.s.sitates the development 
of algorithms designed for the underlying execution environ¬ 
ment. Proper algorithmic design accounts for both the com¬ 
putational and communicational demands of the application. 

Our study focu.sed on algorithms designed for duplicate 
removal on a hypercube databa.se sy.stem. We proposed, 
implemented, and evaluated three algorithms. We then used 
the results obtained by a performance evaluation of these 
algorithms under varying experimental parameters to deter¬ 
mine the optimality of a given algorithm for a particular task. 

Modeling the three algorithms and predicting their perfor¬ 
mance for a given system and load characteri.stics simplifies 
the determination of the algorithm of choice.’ Another ad¬ 
vantage of using the analytical models is that the results can 
be generalized to different-size iPSC/2s, as well as to other 
types of hypercubes provided certain system parameters— 
namely, the processing and the communication .speed.s—are 
known a priori. 






Figure 13. Execution times for data distributed nonuni¬ 
formly among the nodes: 16 nodes, 16,000 data values (a); 
16 nodes, 8,000 data values (b); 8 nodes, 16,000 data val¬ 
ues (c); and 8 nodes, 8,000 data values (d). 
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(d) Uniqueness 


Figure 14. Execution times for different tuple sizes (16 
nodes, 16,000 data values): 4-byte tuples (a); 14-byte 
tuples (b); 24-byte tuples (c); 104-byte tuples (d). 


The distributed database is a special case of a multicomputer 
in which the communication bandwidth is lower than a tightly 
coupled multicomputer. Therefore, the analytical models of 
the three duplicate removal algorithms can be generalized to 
distributed database design as well. 

We plan to design a dynamic query optimization scheme 
that incorporates the results of our saidy in determining how to 
best process a given relational operator within a given query. 
By examining the degree of uniqueness within the input rela¬ 
tions and selecting the coiTesponding optimal duplicate re¬ 
moval algorithm, we intend to reduce the total query 
processing time. The cost of determining the optimal algorithm 
is an important consideration that needs to be addressed. P 
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Table 2. Summary of the algorithm of choice. 





Algorithm of choice 


Constraints 

N=16 

M=16,000 

N=16 

M=8,000 

N=8 

M=16,000 

N=8 

M=8,000 

Initial data not 

Range of data 

Bucket for p > 0.09 

Bucket for p > 0.04 

Bucket for p > 0.02 

Bucket for p > 0.02 

unique and final 
results may be 

Values known 

RR* for p < 0.09 

RR for p < 0.04 

RR forp < 0.02 

RR for p < 0.02 

distributed 

Range of data 
Values not 

known 

Ring for p > 0.34 

RR for p < 0.34 

Ring forp > 0.32 

RR for p < 0.32 

Ring for p > 0.25 

RR for p < 0.25 

Ring forp > 0.3 

RR for p < 0.3 

Initial data 

Range of data 

Bucket for p > 0.1 

Bucket forp > 0.04 

Bucket for p > 0.02 

Bucket for p > 0.02 

unique and 
sorted, and final 

Values known 

RR for p < 0.1 

RR for p < 0.04 

RR for p < 0.02 

RR for p < 0.02 

results may be 

Range of data 

Ring for p > 0.4 

Ring for p > 0.3 

Ring forp > 0.25 

Ring for p > 0.3 

distributed 

Values not 
known 

RR for p < 0.4 

RR for p < 0.3 

RR for p < 0.25 

RR for p < 0.3 

Initial data not 

Range of data 

Bucket for p > 0.1 

Bucket forp > 0.12 

Bucket for p > 0.13 

Bucket for p > 0.13 

unique and final 
results should 

Values known 

RR for p < 0.1 

RR for p < 0.12 

RR for p < 0.13 

RR for p < 0.13 

not be 
distributed 

Range of data 
Values not 
known 

RR 

RR 

RR 

RR 

Initial data not 
unique and final 
results may be 

Range of data 
Values known 

Bucket 

Bucket 

Bucket 

Bucket 

distributed. 

Range of data 

Ring for p > 0.15 

Ring forp > 0.25 

Ring forp > 0.08 

Ring for p > 0.15 

Tuple counts 
not uniform 

Values not 
known 

RRforp <0.15 

RR for p < 0.25 

RR for p < 0.08 

RR forp < 0.15 

initial data not 
unique and final 
results may be 

Range of data 
Values known 

Bucket 

Bucket 

Bucket 

Bucket 

distributed. Data 
not distributed 
uniformly on the 
nodes 

Range of data 
Values not 
known 

RR 

RR 

RR 

RR 

Initial data not 
unique and final 
results may be 

Range of data 
Values known 

Bucket 

Bucket 

Bucket 

Bucket 

distributed. Tuple Range of data 
size=29 bytes Values not 

known 

RR* Recursive reduction 
p Global uniqueness 

Ring 

Ring 

Ring 

Ring 
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Conformance Testing of VMEbus and 
Multibus II Products 


Designers working with the European Community’s Conformance Testing Services program 
have developed a system for testing VMEbus and Multibus II products. Conformance is neces¬ 
sary to allow and guarantee interoperability between products from different vendors. The 
EC’s test system is mainly automated to reduce costs and ensure impartiality. An overview of 
the system’s components familiarizes readers with its procedures. 
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onfomiance testing seeks to detemiine 
if a product or parts of a product (in 
our case, parallel bus interfaces) has 
been built according to the rules of 
the applied standard.^ One way to check confor¬ 
mity is to examine the schematics, programmable 
logic array data, and PROM data, and compare 
the results with the rules of the standard. An¬ 
other way is to create a testing environment rep¬ 
resenting the standard and test the product against 
it. Dedicated devices of the test environment 
monitor the results of interaction. 

This second way requires a big investment in 
methodology, hardware, and softw^are before the 
first product can be tested. But this approach's 
advantage is that it produces test results that are 
objective, reproducible, and independent from 
the test engineer. A consortium of advisors chose 
the second method for the Bus Interface Con¬ 
formance Test (BICT) project. 


Test system 

The BICT system tests the confomiity to stan¬ 
dards of bus interfaces of VMKbus and Multibus 
II boards and their backplanes.^-^ 'Fhe test system 
represents the standards and offers all the op¬ 
tions the standards give. I'his approach allows a 
board to be plugged into the test system where 
predefined automatic or semiautomatic proce¬ 
dures test the features of the unit under test.'’ 


The bus stanckirds specify four main categories: 

• timing and logical behavior of signals, 

• message-passing protocol (for Multibus II 
only), 

• electrical properties of boards and back¬ 
planes, and 

• mechanical dimensions of boards and 
backplanes. 

A specific configuration of the test system and its 
devices is required for each category. Figure 1 
on page 58 shows the general configuration of 
the BICT system, including the dial bench gauge, 
which is operated manually.'^-^ 

A test system controller manages the activities 
of the test system devices and gives instructions 
to the operator, or test officer. For each test cam¬ 
paign the controller collects the measuring re¬ 
sults, stores them in a database, and generates a 
test report on the basis of collected data. The 
controller performs a series of functions, (see 
box on page 58) classified by type of action.^ 

In this article we explain a test campaign in 
the sense of a walk through the test system, in 
the way a customer-such as a manufacturer, 
reseller, or original equipment manufacturer- 
would see it. We focus on testing bus interfaces 
because it is more complex than testing 
backplanes. 
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BICT 


IEEE-488 bus 



Dial bench gauge 
(operated manually) 


Figure 1. BICT system. 

Test system controller functions 

For the classifications listed here, the controller provides the following infonnation or performs the following tasks. In 
the case of manual tests, the operator provides the information, which the controller then analyzes. 


The test officer 

• Information about the actual test and test state 

• Instructions about mechanical tests to be carried 
through by the officer 

• Information about test results by immediate check 
against parts of the relevant standard 

• Requests test results and offers data entry capabilities 

Test system controller 

• Start/stop logic analyzer 

• Start/stop pattern generator 

• Poll logic analyzer and digital storage oscilloscope on 
trigger or time-out if no triggering 

• Select and start test routine of the unit’s upper tester (via 
communication channel) 

• Upload data from test devices 

• Download data to test devices 

Digital storage oscilloscope 

• Measurement of signal edges and waveforms for 
electrical test 

• Measurement of voltage levels for static electrical test 


Logic analyzer 

• Run the trigger to monitor the logical and timing behav¬ 
ior of the unit 

• Store captured data in case of triggering 

• Access to database of trigger setups 

• Trigger on certain messages or bus states 

• Store captured message in case of triggering 

Pattern generator 

• Run sequence of patterns for bus operation initiation/ 
response 

• Control of delay lines for signal adjustment 

Power supply 

• Provide unit with current in the specified voltage range 

• Measurement of current consumed by unit 

Mechanical test devices (dial bench gauge, slide 
calipers) 

• Measurement of boards 

• Measurement of devices mounted on boards 
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BICS/BIXIT 

Before testing begins, a customer provides the following 
information about the unit on two DOS-based, electronic 
forms: 

• Bus interface conformance statement (BIOS). The 

customer enters the capabilities and options of the stan¬ 
dard implemented in the unit. 

• Bus interface extra information for testing (BIXIT). 
The customer provides further details on the implemen¬ 
tation features of the unit, such as address ranges, inter- 
aipt levels, bus request levels, and message types. 

BICS/BIXIT is a menu-driven, interactive input program that 
automatically checks for completeness and consistency as 
the user fills in the form. It features dynamic data manage¬ 
ment and automatic data interfacing and recording. Figure 2 
depicts a BICS/BIXIT input screen requesting information 
about the implemented features of a memory slave. 

Chip-level testing is complex, time-consuming, and not cost- 
effective for the purpose of confonnance testing. Board manu¬ 
facturers use relatively few standardized silicon devices to 
implement the interface drivers, and sufficient and reliable 
infomiation about their capabilities is available. We can there¬ 
fore avoid testing on the chip level.^ 

In lieu of such testing, the board manufacturer provides 
information about drivers, transceivers, and other devices that 
implement the bus interface, on another DOS-based form, 
the vendor claim. The vendor returns this form, along with 
the BICS and BIXIT data, to the center, where it is analyzed 
by a test program running on the controller. 

In the first step of testing, static con¬ 
formance review, the controller checks 
this information against a silicon database 
and the relevant rules of the standard. 


transmit via the IEEE 488 bus to the controller. The program 
determines if the read value is plausible and compares it with 
the requirements of the standard. 

The dial bench gauge consi.sts of a base plate on which 
smaller plates are mounted, carrying a reference bolt for each 
position to be measured. The positions of these reference 
bolts vary for different board sizes (single, double, and triple 
European cards, l60 or 220 mm long). 

The tester fixes the unit on two clamping strips. Before 
taking the measurements, the tester samples a reference board 
to get the offset values. This procedure, called the calibration 
phase, delivers the offset factors, which are stored in the 
controller and integrated into the measurement calculation. 
Figure 3 on page 60 shows the configuration for the me¬ 
chanical test. 

Testers use a slide caliper for special cases where the depth 
of the gauge is not applicable. The rack fit test is a purely 
qualitative test, as postulated in the VME standards. The tester 
perfomis this test manually. 

Electrical test 

The electrical test verifies rules concerning requirements 
for electrical characteristics and behavior. The measurements 
are electrical, chronological, or a mixture of the two. 

One part of the electrical test measures electrical charac¬ 
teristics independent of special capabilities of the unit, such 
as connectivity, temiination networks, and power consump¬ 
tion. The other part times signals or performs specific func¬ 
tions of the unit, measuring signal waveforms (rise and fall 
times, high and low times), signal levels, and power on/off 
behavior. 


1 File 


Edit 


= BICS/BIXIT Edit 
View Check 


Document 


Quit 


Mechanical test 

Test officers manually assist in the next 
step, testing w4iether the board fits into a 
standard rack and its backplane. A me¬ 
chanical test program running on the con¬ 
troller guides the tester and indicates the 
points to be measured. The mechanical 
test is based on two test methods: the 
outline test with the dial bencli gauge and 
the rack fit test with the rack simulation. 

For the outline test, the dial bench 
gauge (see Figure 1) can be configured 
according to the controller’s instructions 
and the test officer mounts the unit board 
onto the gauge. The tester then applies 
the test tool to the measuring points asked 
for by the test program. The read values 



Figure 2. Input screen of BICS/BIXIT. 
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As with the mechanical test, the controller guides the op¬ 
erator through the electrical tests (see Figure 4). The test 
officers handle the test probes manually, since for each rule 
different test points must be checked. The obtained results 
are uploaded to the controller via an lEEK 488 interface, where 
a program compares them with the specification values and 
generates a statement for the report. 

The electrical test does not measure driving and receiving 
characteristics of the signal lines. Testing them would mean 
testing silicon and, as mentioned earlier, this exceeds the 
scope of conformance testing on the board level. Therefore, 
the test officers refer to the infomiation provided on the ven¬ 
dor claim form to check these devices or assemblies. Opera¬ 
tors apply this method when items are not testable or may 
change from board to board. Items tested in this way include 
used drivers, receivers, and connectors. 

Timing protocol test 

The timing protocol test determines whether the interface 
modules of the unit communicate through the bus with other 
modules in accordance with the relevant timing protocol given 
by the test specification. (Interface modules include the in¬ 
terrupt handler, interrupt requester, and arbiter.) The system 
individually tests every module or joined modules of the unit. 


A pattern generator stimulates the module(s) under test 
through the bus, while a logic analyzer detects anomal (or, 
illegal) events on the bus (see Figure 5). The stimulating pat¬ 
terns created on the bus by the pattern generator, as well as 
the responses created on the bus by the module(s) under 
test, constitute a bus operation. 

The standard Riles define the modules and their options, 
or capabilities. Modules can participate in different bus cycles 
according to their defined capabilities. For example, a VME 
master module with l6-bit data transfer capabilities may par¬ 
ticipate in single- and double-byte read-and-write data trans¬ 
fers, as well as block transfers. But it may not participate in 
Byte (0-3) data transfers. 

A module option may also define a cycle option. A VME 
arbiter module may resolve a request according to a stan¬ 
dard-specific algorithm (single, priority, or round robin) and 
a VME requester may release the bus according to one of 
two algorithms (release-when-done or release-on-request). 

Bus operation. Every bus operation is specified as a set 
of bus cycles and hence as a set of timing and data protocol 
test rules. Part of the bus operation stimulates the test mod¬ 
ule. 'I’he other part is the response of the unit, w^hich will be 
analyzed by the logic analyzer. 

For example, a CPU module reads data and writes it to a 



Figure 3. Mechanical test. 


Figure 4. Electrical test. 
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Figure 5. Timing protocol test: test of masters. 


memory board. First it runs an arbitration cycle to gain the 
bus. Next it runs a read cycle to fetch the data from the 
mttdule. To write its data to the module, it perfonns a write 
cycle. Finally, the CPU module relea.ses the bus. In this ex¬ 
ample the bus operation consists of the arbitration, read, and 
write-and-release cycles. In every cycle the CPU module 
(stimulus) gives signals that are driven by the memory mod¬ 
ule (re.spon.se). 

Pattern generation. The signal relations, into which tim¬ 
ing rules are translated, define the limits of the stimulating 
rules. For worst-case timing protocol testing, we developed 
worst-case .stimulating patterns. For basic interconnection tim¬ 
ing protocol testing, the sy.stem .stimulates the unit using more 
tolerant limits. 

A pattern generation program, reads signal relations asso¬ 
ciated with a bus cycle as input infomiation and translates 
them to stimulating patterns and wait statements for the pat¬ 
tern generator. Data protocol te.st rules are taken into ac¬ 
count by the pattern generation program, as needed. 

Real-time analysis. The real-time analysis methcxl tests 
behavioral characteristics (or, functional tests) and verifies 
the relevant rules. Real-time analysis facilitates exhaustive 
te.sting and supports the BICT philosophy: 


Many complete bus cycle.s/operations perform coasecu- 
tively. The logic analyzer monitors the signals on the 
bus in real time and only freezes data in its memory if a 
trigger condition is fulfilled. While cycles run, the con¬ 
troller downloads different trigger words to the logic 
analyzer. 

We decided to am the bus cycles/operations several times 
because 1) asynchronous behavior leads to deviations in re¬ 
action times, and 2) different states of the unit may cause 
different reactions. This method uses anomaly triggers, so 
that only in ca.ses where the standard is violated does the 
logic analyzer upload the data to the controller for the test 
repoa. 

Control-and-status register test 

The control-and-status regi.ster set holds .system informa¬ 
tion in a standardized way. During a .system startup, the con¬ 
figuration program uses it for systemwide initialization, 
configuration, and diagno.stics. 

The test for Multibus II checks the content, .structure, and 
predefined functions of the control-and-.status regi.ster. .Since 
almost every item tested is a different ca.se, we established an 
individual test procedure for each rule. During the te.st these 
procedures are executed one by one, according to the unit’s 
capabilities. 

Message-passing protocol test 

The message-passing protocol describes a communication 
and data exchange procedure between agents using well- 
defined messages. We call it “network in a crate” because it 
applies mechanisms similar to those used in LANs. One mes¬ 
sage packet is transferred using one sequential data transfer 
bus operation in the me.ssage address space. In general, we 
define two types of messages, 

• Unsolicited. An unsolicited message is a one-way trans¬ 
fer of a data packet to a destination addre.ss. The packet 
carries up to 28 bytes of data. Sy.stems use unsolicited 
messages for interprocessor communication, very short 
data transfers, and virtual interrupts. Within one Multi¬ 
bus II system, we can define up to 255 message (inter¬ 
rupt) sources and 255 message (interrupt) de.stination.s. 
The broadca.st message, a special unsolicited message, 
is a simultaneous, one-way transfer of an unsolicited 
message to all available destination addresses in a 
system. 

• Solicited. A solicited me.ssage transfer is a specified, fi¬ 
nite exchange of data packets that transfers up to 16 
Mbytes of data. Before such a message is .sent, the sys¬ 
tem negotiates to ensure there is enough free data buffer 
available at the de.stination and to define other transfer 
parameters. The solicited message fragments into small 
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data packets, each containing at least 32 bytes. The sys¬ 
tem then sends them over the bus, one at a time. 

The message-passing protocol test for Multibus II checks 
that the unit performs the sending and receiving of unsolicited 
and solicited messages correctly. For solicited messages, it 
also tests the negotiation of the transfer and the correct appli¬ 
cation of the negotiated parameters. In addition, the content 
of each message is checked. 

Since the message-passing protocol uses a bus protocol on 
signal level, the unit must pass the electrical and functional 
test (on signal level) successfully before operators perform 
the message-passing protocol test. 

To perform this test, the vendor installs an upper tester 
which substitutes for the software layer above the unit. Fig¬ 
ure 6 shows that a test routine runs through a communica¬ 
tion channel between the controller and the upper tester. 



Figure 6. Message-passing protocol test. 


BICT divides the message-passing test into two entities, 
each applying different analysis methods. 

• Messages sent. The controller uses two methods to ana¬ 
lyze outgoing messages. In most cases, the system uses 
a soft\\'are analysis method. The Multibus II interface 
receives the unit’s messages, which are then analyzed 
by the controller. But in some cases, in which the logic 
analyzer must monitor the bus, the system uses a bus 
analysis method. This method measures, for example, 
the duty cycle during a solicited message transfer, or 
detects unexpected messages sent by the unit. 

• Messages received. Two methods can be used to test the 
message-receiving capability of a unit. In the on-board 
message verification method the upper tester analyzes 
the messages received by the unit and reports the test 
result to the controller. Since this will blow up the upper 
tester and with it the vendor’s investment in installing it, 
we prefer the message reflection method. Here, the mes¬ 
sages sent to the unit are reflected by the unit and ana¬ 
lyzed by the test system. 

To ensure that detected errors are caused by the unit's re¬ 
ceiving capability and not by its sending capability, the unit 
must pass the sending capability before the test officer ap¬ 
plies the message reflection method. 

Test example 

To demonstrate how the conformance test is carried out in 
practice, we describe here a message-passing protocol test 
case for Multibus II. Rule 13.5.1.2, Section 5 of the Multibus II 
test specification states, “If an agent is not able to receive a 
requested solicited message, it must send a buffer reject mes¬ 
sage to the agent that has sent the buffer request message.” 
This mle checks the negotiation of a solicited-message trans¬ 
fer. The buffer reject message is an unsolicited message that 
sets up a solicited-message transfer. A system refuses the 
buffer request if, for example, a receiver does not have enough 
resources or the resources are, at that moment, not available. 

Verification procedure. To test this rule, we apply the 
softw'are analysis method. At the beginning of the verifica¬ 
tion procedure the controller, via the communication chan¬ 
nel, starts a test routine of the upper tester installed on the 
unit. The selected test routine is responsible for receiving 
solicited messages. 

Then the controller, via the Multibus II interface, sends a 
buffer request message to the unit (see Figure 7). In this case 
the requested length of the solicited message, 10 Mbytes, is 
greater than the available local resources of the unit, 1 Mbyte. 
The size of the unit’s local resource has already been de¬ 
clared on the BICS/BIXIT form, so the controller can auto¬ 
matically calculate the requested length of the message. 

Now the controller waits until the Multibus II interface re- 
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ceives a message. If no message arrives 
within a specific time, the test of the rule 
has failed. If a message arrives, the con¬ 
troller verifies whether the message is a 
buffer reject. 

Results. At the end of the verification 
procedure the controller interprets the 
results and appends them to a log file. 
In the case of rule 13.5.1.2 .section 5, it 
assigns a “failed” result if the unit sends 
no message or a message that is not a 
buffer reject. 

Test report. After the test, the center 
prepares a report (see Figure 8) and 
sends it to the customer. The report de¬ 
scribes the unit, the test suite, and all 
useful infonnation concerning the ex¬ 
ecuted tests and their results. The report’s 
first section summarizes the number of 
tests applied and failed. A later .section 
details the applied tests and the test se¬ 
quence. If a rule was violated, the re¬ 
port notes the error and gives the 
corresponding rule text. A description 
of the test method applied can be found 
in the te.st specification. Customers can 
obtain more detailed data on the test ex¬ 
ecution and results, such as captured 
wavefonns or data, from the test labora¬ 
tory if needed for further analysis. 

The BICT system offers vendors 

the security of an impartial test of a 
board’s conformity to bus standards. 
Confonnity is a necessary precondition 
for the interoperability of bcrards shar¬ 
ing a bus. 

To test a product, a center needs in¬ 
formation only about the features and 
parameters of the board, not design in¬ 
formation such as schematics, PROM, or 
PLA data. Design information stays with 
the originator. 

Tile mainly automated test procedures 
ensure cost- and time-effectiveness and 
guarantee objective results. A basic aim 
of the BICT project was to eliminate as 
much human influence as possible. Tlie 
re.sulLs depend only on the .strengths and 
weaknesses of the te.st .system. 'We are 
optimistic that any remaining weakne.s.ses 
will be eliminated as the test .system 
matures. 


Test Unit under test 

controller (with 1-Mbyte memory) 



Message passing protocol test 
The following errors were identified: 

Start End 


Rule number 

Date 

Time 

Date 

Time 

Result 

13.4-1 

910522 

083910 

910522 

083910 

passed 

13.5.1.2-4 

910522 

083910 

910522 

083911 

passed 

13.5.1.2-5 

910522 

083956 

910522 

083957 

failed 

*** buffer grant message received from UUT 




13.5.1.2-7 

910522 

084002 

910522 

084003 

passed 


Details of failed tests 


The following test cases failed: 

Rule 13.5.1.2-5 

If an agent is not able to receive a requested solicited message, it must send a 
buffer reject message to the agent that has sent the buffer request message. 


Figure 8. BICT report section. 
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Starting this year, the service will be available at the test 
centers involved in the BICT project: the VDE Test and Cer¬ 
tification Institute in Offenbach, Germany; S. Seferiades and 
Associates (SSA) in Athens; and Istituto Italiano del Marchio 
di Qualita (LMQ) in Milan. (D 
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How I spent my Christmas vacation 


y holiday tradition is to attack the big 
projects that I’ve been putting off all 
year. This year I plunged into the 

Macintosh. 

I have a Mac SE/30. I use it every day in my 
consulting work. I also use it for reviewing the 
Macintosh software that I receive in abundance. 
This is the source of the problem. 

In my consulting work I do simple word pro¬ 
cessing. I occasionally transmit text files over 
phone lines. My Mac SE/30’s original memory 
size of 2 Mbytes was more than adequate for 
these tasks. 

I enjoy reviewing new software, since the pro¬ 
cess forces me to learn to use powerful new 
tools that my natural inertia would otherwise 
keep me from exploring. Unfortunately, new 
software grows by leaps and bounds, and I can’t 
keep up with it unless my computer does the 
same. Thus, my project for the holidays was to 
upgrade my Macintosh to 8 Mbytes. 

Memory upgrade 

Looked at objectively, a Macintosh memory 
upgrade is a pretty small project. It takes less 
than half an hour, but somehow I managed to 
put it off for nearly four months after I had ev¬ 
erything I needed in hand. I probably would 
never have done it at all if it had not been for 
the idiot-proof package put together by the mail¬ 
order company called Mac Warehouse. 

I called Mac Warehouse’s toll-free number 
(800) 255-6227 to order the 1-Mbyte single in¬ 
line memory modules (SIMMs) required to up¬ 
grade a Macintosh. As I write this, their advertised 
price is $59 each. I needed eight of them to up¬ 
grade from 2 Mbytes to 8 Mbytes. Each one re¬ 
places one of the original eight 256-kilobyte 
SIMMs. The Mac Warehouse people told me I 
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needed a simple toolkit, including a grounding 
wrist strap, which they sold me for $9. They 
included a video cassette demonstration of the 
upgrade process. The entire package showed 
up at my house the next day. Then it sat in my 
office for more than three months. 

The video is marvelous. I am about as 
unmechanical as a graduate of an engineering 
school could possibly be. Removing the SE/30’s 
motherboard would have been beyond me with 
only written instructions. A woman with a reas¬ 
suring manner demonstrated all of the simple 
but hard-to-explain-in-print steps required to 
upgrade any Mac’s memory. With exquisitely 
manicured hands she deftly manipulated tools, 
eased wire harnesses out of hard-to-reach sock¬ 
ets, and snipped a resistor lead. She carried on a 
running description of what she was doing, 
calmly mentioning the inherent dangers at cer¬ 
tain points. My favorite was when she said, “If 
you crack the video tube, it ivill implode.” 

I carried out the entire task in about a half 
hour. I sat on my living room rug with the 
grounding strap on my left wrist, the VCR re¬ 
mote control unit by my right hand, and the TV, 
my Macintosh, and my tools in front of me. The 
video runs 11 minutes, but I kept replaying key 
parts until I was sure I understood what I was 
doing. When I reassembled my Mac, everything 
worked perfectly. 

Operating system upgrade 

I would never change my operating system 
software if I didn’t have to. Unfortunately, new 
software invariably makes old operating systems 
obsolete. When I loaded the latest version of 
Mathematica onto my Macintosh, it immediately 
announced that it wouldn’t work on the version 
of the Macintosh operating system that I was 
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running. I could have put in a minor 
upgrade from 6.03 to 6.07, but I de¬ 
cided to switch to System 7, Apple’s 
long-awaited major upgrade. 

The Berkeley Macintosh Users’ 
Group (BMUG) is a wonderful resource 
for anyone who uses Macintosh com¬ 
puters. When I decided to upgrade to 
System 7, I called them at (510) 549- 
BMUG. I bought their package deal, 
which includes the 10-disk upgrade set 
from Apple and The Little System 7 Book 
by Kay Yarborough Nelson (Peachpit 
Press, Berkeley, 1991,158 pp., $12.95), 
all for $25. I did this about the same 
time I bought the SIMMs, and every¬ 
thing sat around my office for just about 
as long. 

System 7 is quite different from pre¬ 
vious Macintosh operating systems, but 
by the time I sat down to use it 1 felt 
right at home. I attribute this to the 
quality of the upgrade package that 
Apple put together, the excellence of 
Ihe Little System IBook, and the many 
informative articles about System 7 in 
the BMUG Newsletter. 

Users have come to expect installa¬ 
tion programs to guide them through 
the installation of large application 
packages. While installation programs 
have certainly improved considerably 
over the last few years, Apple’s upgrade 
software is a cut above any other that 
I’ve seen. It is a model for others to 
follow. Especially impressive were two 
HyperCard stacks designed to familiar¬ 
ize users with the new features of Sys¬ 
tem 7. The one dealing with the new 
networking features is especially im¬ 
pressive. It simulates the behavior of 
the system, allowing you to make menu 
selections, click buttons, and so on, just 
as you would if you were performing 
the networking functions it is teaching 
you. 

While my move to System 7 was 
easy, it also had its rough spots, all of 
which hinge on compatibility. Because 
the underlying software of System 7 
differs substantially from previous ver¬ 
sions, many software packages de¬ 
signed for previous versions will not 
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run. The Apple upgrade program be¬ 
gins by printing a list of all of the ap¬ 
plications on your machine. It labels 
each one as "fully compatible," “mostly 
compatible,” “must upgrade,” or “in- 
fonnation not available.” Unfominately, 
the last two categories predominated 
on my machine. The only ones listed 
as fully compatible were Word, Excel, 
Mathematica, and Mac Draw II. 


System 7 differs 
from previous 
Mac operating 
systems; but I felt 
right at home 
with it. 


The compatibility printout contains 
helpful information on upgrading old 
software. It lists your version number 
and that of the first compatible version. 
It gives the phone number of the manu¬ 
facturer of each package you need to 
upgrade. In some cases it indicates that 
the upgrade package actually includes 
a compatible version. Unfortunately, it 
appears to say that the upgrade pack¬ 
age includes a compatible version of 
HyperCard, but this is only true if you 
buy the System 7 Personal Upgrade Kit 
from Apple. My BMUG package did 
not contain HyperCard, and the 
HyperCard version that came originally 
with my SE/30 is incompatible with 
System 7. So I had to scurry around to 
obtain a working version. 

While simple incompatibility with the 
programming of System 7 is a prob¬ 
lem for most packages, some have an 
even worse problem. System 7 makes 
a large part of what they do obsolete. 
The popular Suitcase II and Master Jug¬ 
gler programs are prime examples. 
Many of their capabilities for manag¬ 
ing fonts, sounds, and desk accesso¬ 


ries are no longer needed. System 7 
handles sounds and fonts by allowing 
you to drag their icons into or out of 
the system file. It eliminates the dis¬ 
tinction between applications and desk 
accessories altogether, since it now in¬ 
corporates the features of Multifinder. 
The apple menu simply lists the con¬ 
tents of a folder in the system file. The 
apple menu folder can contain any 
application, document, or folder. Se¬ 
lecting an item from the apple menu 
causes the item to be opened. 

Another important feature of System 
7 makes front-end programs like Power 
Station less important. System 7 allows 
you to create an alias for any docu¬ 
ment, application, or folder. The alias 
consumes a tiny amount of disk space 
and serves as a pointer to the actual 
item. Thus, for example, you can leave 
Excel or Word in its own folder but 
place an alias for each in the apple 
menu folder or the start-up folder. You 
can make aliases for the apple menu 
folder and the system file and leave 
the aliases on the desktop. I made an 
alias for the text file that contains this 
column and placed it in the apple menu 
folder. When I select “Column” from 
the apple menu, the system opens this 
text file and the corresponding word 
processor. 

One important Power Station func¬ 
tion that I haven’t figured out how to 
do in System 7 is to associate an icon 
with an application and a specific set 
of documents to be opened with it. I 
would certainly keep Power Station 
under System 7 for its document¬ 
handling capabilities alone, but unfor¬ 
tunately my version caused the system 
to crash when I tried to am it. The 
Apple compatibility checker provided 
no information about Power Station, 
so this will take additional work on my 
part. I encountered a number of other 
small nuisances like that during the 
upgrade, but on the whole I’m quite 
happy. 

As I write this column using Microsoft 
Word ainning under System 7, 1 am 
simultaneously also running Mathe- 










matica, HyperCard, and the Variable 
Symbols Mathematica Help Stack (see 
Micro Review, Aug. 1991)-1 even have 
another 250 Kbytes left over in my new 
8-Mbyte memory. I’m impressed. I 
wonder how long this euphoria will 
last. 

Books 

The holidays are also a nice time to 
sit by the fire reading, so I managed to 
look at a number of good books. One 
of the best of these (see next para¬ 
graph) is a little old for a computer 
book, and it does show its age in 
places, but it is so good that it’s worth 
reading anyway. I hope the publishers 
bring out an updated version. 

The Sachertorte Algorithm and 
Other Antidotes to Computer Anxi¬ 
ety, John Shore (Viking, New York, 
1985, 286 pp., $16.95) 

Shore states his purpose to be pro¬ 
motion of a general understanding of 
computers. As a reflective and witty old 
hacker, he is well qualified to achieve 
that purpose. It looks to me as though 
he has done so, although 1 can’t look 
at the book through the eyes of the 
intended audience. All 1 can say is that 
again and again he makes points that 
rd want to make to an interested and 
intelligent beginner. It’s hard to com¬ 
municate the overall effect of this by 
quoting isolated phrases, but here are 
a few that appealed to me: 

In general, jargon-filled error 
messages are symptoms of a 
basic problem, namely that the 
designers and programmers of 
many office and personal com¬ 
puter systems have failed to 
separate their own concerns 
from those of the user. 

If you buy a new car and 
spend the next year having the 
dealer fix things that didn’t 
work right to begin with, you 
don’t say that your car is being 
maintained; you say you 


bought a lemon. By this crite¬ 
rion, most software products 
are lemons. 

While standing recently in a 
cold shower, I had occasion to 
think about replacing my hot 
water heater. Because I had 
been thinking about software 
engineering before the water 
turned cold, I was impressed 
by the extent to which I could 
choose a new water heater 
without thinking about air con¬ 
ditioners, telephones, win¬ 
dows, or practically anything 
else except the number of 
people in the house and the 
number of dollars in my bank 
account. 

The title of Shore’s book derives from 
the extended example he gives of try¬ 
ing to use his mother's recipe for Aunt 
Martl’s Sachertorte. He illustrates the 
process of stepwise refinement as he 
moves from his mother’s original in¬ 
structions (prepare ingredients; bake 
cake) to a more detailed sequence 
of steps that are all within his own 
more limited repertoire of cooking 
operations. 

Shore wants his readers to under¬ 
stand that programming computers is 
an exercise in careful and precise com¬ 
munication. The programmer must 
communicate with the machine, but 
even more importantly, the program¬ 
mer must communicate with other 
human beings. In this sense, program¬ 
ming is a literary activity. When seen 
this way, some of the problems of the 
“software crisis” are easier to under¬ 
stand. Writing is easy, but writing well 
is hard. 

Shore also wants us to look at pro¬ 
gramming as mathematics and as ar¬ 
chitecture. Here he laments the fact that 
millions of people are acquiring per¬ 
sonal computers and programming 
them with the languages and tech¬ 
niques of the sixties. The concepts of 
abstract specification and infonnation 


hiding that were introduced in the early 
seventies are only beginning to be 
widely used today. Proofs of correct¬ 
ness, long recognized to be theoreti¬ 
cally possible and highly desirable, are 
still no more than classroom exercises. 

The tools of the writer, the math¬ 
ematician, and the architect are the keys 
to managing the complexity of large 
software packages. The failure to man¬ 
age complexity, in Shore’s view, is the 
cause of our current software crisis. 
Shore does a good job of explaining 
and illustrating the problems of com¬ 
plexity and the failure of our tools in 
terms that should be accessible to read¬ 
ers of all backgrounds. 

If you see this book in your local 
bookstore, buy a copy to give to a 
nontechnical friend. You might enjoy 
reading it yourself first. 

Mathematica: A Practical Ap¬ 
proach, Nancy Blachman (Prentice 
Hall, Englewood Cliffs, N.J., 1992, 380 
pp., $30) 

I wanted to install 8 Mbytes of 
memory in my Macintosh so I could 
run Mathematica comfortably. Unfor¬ 
tunately, inadequate memory is not the 
only obstacle to using Mathematica. 
Mathematica is a large program with 
many capabilities, and before 
Blachman’s book we had no ad¬ 
equately tutorial introduction to it. 
Stephen Wolfram, the creator of 
Mathematica, wrote a book that is sup¬ 
plied with the program, but as 
Blachman points out, learning 
Mathematica from that book is like 
learning English from a dictionary. 

In my August 1991 column, I dis¬ 
cussed some of the work Nancy 
Blachman and her company. Variable 
Symbols, have done to teach people 
to use Mathematica. She based the cur¬ 
rent book on undergraduate courses 
she taught at Stanford University. It 
draws on her extensive experience with 
the difficulties people encounter in 
learning Mathematica. 

I’m impressed by the quality of this 
book. Blachman provided Prentice Hall 
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with camera-ready copy, so she can 
take credit for its attractive appearance 
and its excellent editing. Mathematica 
can also take some credit, since the 
book began as a Mathematica 
notebook, and many of the illustrations 
were originally done using Mathe¬ 
matica. 

Learning Mathematica is not my top 
priority, so working my way through 
this book will be a background project 
for a while. In my next column I'll tell 
you how it’s going. So far, I like 
Blachman’s approach. 

The Parents Guide to Educational 
Software. Marion Blank and Laura 
Berlin (Microsoft Press, Redmond, 
Wash., 1991, 424 pp., $14.95) 

The authors of this book are an edu¬ 
cational psychologist and a develop¬ 
mental psychologist. Their intended 
readers are parents who want to give 
their children the advantages of com¬ 
puter-assisted education but don’t 
know what to do next. As the authors 
point out, 

Currently you can choose from 
over 10,000 programs, cover¬ 
ing almost every subject 
imaginable. And, as always, ex¬ 
cellence is rare. If you’re like 
most parents, you probably 
don't see this array of options 
as a pool of resources but as a 
mire of confusion. 

The authors have selected more than 
200 programs for careful rating. They 
have applied several criteria to this se¬ 
lection, and some of these criteria dif¬ 
fer from those used in selecting 
programs to be used in a school set¬ 
ting. Basically, all of the programs in 
the book are of high quality, attractive 
to children of the intended age group, 
sufficiently varied that children won't 
tire of them immediately, and inexpen¬ 
sive enough for many families to con¬ 
sider buying them. Where possible the 
authors tried to select programs that 
are helpful to the 10-15 percent of the 
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students who have learning difficulties 
in specific learning areas. 

Most of the book is devoted to thor¬ 
ough reviews of the selected programs, 
but the authors begin with helpful in¬ 
troductory material. They explain how 
computers can help with learning. Then 
they review the major outlines of the 
elementary school curriculum, focus¬ 
ing on what skills are required by each 
area of study. Finally, they talk about 
how to set up an educational computer 
center at home and how to help your 
child use the programs that you 
acquire. 

The reviews all follow a common 
format. The authors identify each pro¬ 
gram as being for a specific age range 
and school grade range. Then they 
summarize the hardware requirements, 
the abilities a child must have to use 
the program, and the curriculum areas 
that the program addresses. A detailed 
discussion of the program itself follows 
these short summaries. 

I have no experience with any of 
the programs the authors have in¬ 
cluded, so I can’t judge their reviews. 
However, the reviews that I read ad¬ 
dressed issues that I'd want to consider 
in buying .educational software. One 
interesting example was a S14.95 pro¬ 
gram for children in the two-to-four age 
range. The child would not be able to 
start the program because of a compli¬ 
cated copy protection scheme, so the 
parent would always have to be there 
to start it. The authors emphasize this 
point without passing judgement. I 
admire their restraint. 

I think this book is a really good in¬ 
vestment for anyone considering buy¬ 
ing educational software. 

Annual phone list update 

My year end would not be complete 
without the annual overhaul of my 
phone list. Tliis year I finally automated 
the process. 

Address Book Plus (Power Up Soft¬ 
ware Corp, 2929 Campus Drive, San 
Mateo, CA 94403, (415) 345-5900, 
$99.95) 


This program provides what most 
people need to handle their little black 
books. It lets you build a small data¬ 
base, define selection and sorting cri¬ 
teria, and generate reports in a variety 
of useful formats. These features are 
mostly built in, so that users don’t need 
to use query or report generation 
languages. 

The program supports printing for¬ 
mats corresponding to the popular 
pocket and desk appointment books. 
It also has a special Instabook format, 
which lets you print a stack of two- 
sided pages and turn them into a 
pocket-size address book, simply by 
cutting, folding, and stapling. I tried it, 
and it really works. 

The program offers an integrated 
telephone dialing facility. It seems to 
have all of the flexibility necessary to 
deal with prefixes, area codes, choice 
of long-distance canier, international 
dialing, modem control, and so on. 

Power Up has gone to great lengths 
to make this program fit into your way 
of doing things. It can import or ex¬ 
port files in formats that you define, 
and it has built-in fonnats allowing it 
to import from Flypercard and to ex¬ 
port to personal organizers from Casio 
and Sharp. It also works with Power 
Up’s mail merge program (Letter Writer 
Plus). 

It’s taken me a long time to get 
around to using a program like this, 
because no program comes close to 
doing all the little things I do by hand, 
but this one is good enough. What it 
does for me outweighs the things I’ll 
have to give up. 
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Understanding the Acronym Tower of Babel 


y am often asked what a particular acro¬ 
nym means, or where it came from. After 
reflecting on these questions, I realized 
that they might be the meat for an interesting 
column. After all, we use these “shortened” words 
to convey large amounts of information within a 
standard, or other technical documents. 

Interestingly, many of the acronyms and defi¬ 
nitions used today came about because of the 
telegraph and Teletype era. Since telegraphers 
had to tap in each letter of a word sent, they 
naturally wanted to use the smallest words pos¬ 
sible. Consequently, a unique shorthand was cre¬ 
ated. This shorthand carried over to Teletype 
users, partly due to tlie slowness of the system 
and the difficulty of typing on the keyboard. Any 
of you that remember a KSR 35 will understand 
what I mean. 

Interestingly, World War II and the Korean 
conflict also contributed to the acronym pool. Tlie 
idea was to send as much infomiation as fast as 
possible so the enemy wouldn’t have a chance to 
locate the signal or decipher it. 

Another interesting note concerns the contri¬ 
butions industiy has made to the acronym Tower 
of Babel. The airline industry, in days before so¬ 
phisticated reservation systems, used a form of 
shorthand called Fast Talk. This method let an 
agent in Chicago call an agent in Des Moines to 
reserve a seat through to Los Angeles. 

Even with the advent of computer systems and 
better communications than Teletype, the airline 
industry developed a worldwide standard abbre¬ 
viation system called SIPP (Standard Infomiation 
Processing Protocols) codes. Tliis system of codes 
allowed a reservations agent at United Airlines to 
send a great deal of easily understood infomia¬ 
tion about a passenger to another carrier. For 
example, C-19 meant that the passenger required 


0272-1732/92/0200-0069$03.00© 1992 IEEE 


Special handling. Other codes indicated special 
food or an unpleasant passenger. 

Similarly, the airlines also had a system for 
identifying lost baggage. For example, 02 Blue 
indicated a blue, hard-sided Samsonite bag and 
03 denoted an American Tourister bag. Type 20, 
on the other hand, was a folding garaient bag. I 
was one of the developers of United Airlines To¬ 
tal Apollo Baggage System (TABS) that made use 
of tliis infomiation to automatically search for lost 
bags—a system that ultimately saved United sev¬ 
eral million dollars a year in mishandled baggage. 

Of course, NASA (the US National Aeronautics 
and Space Administration) probably gets the prize 
for developing acronyms. It isn’t unusual to pick 
up a NASA-developed document and find that it 
can't be read without the help of an acronym 
glossary close at hand. NASA’s rationale is that so 
much infomiation has to be conveyed that ab¬ 
breviations, acronyms, and initialism (AA&I) en¬ 
hance the readability of the document. 

Because of limited space. I’ve taken two ap¬ 
proaches in my acronym/definition list beginning 
on the next page. I’ve chosen some interesting 
acronyms and expounded upon them, and oth¬ 
ers I've just listed with their meaning. I could write 
volumes on this area alone or include multiple 
meanings, but I won’t. I covered lots of ground 
with this column. You can probably guess, how¬ 
ever, that I only touched the surface. If you have 
a favorite acronym that you would like to share 
and can tell a short (no more than two-paragraph) 
story, I’ll include it in forthcoming Micro Stan¬ 
dards columns. 

Finally, some parting thoughts: Didja Isic] know 
that Soroc, the now-defunct spin-off terminal 
manufacturer from Lear Siegler, used an anagram 
for its name? Take a close look at Soroc, and you’ll 

continued on p. 72 
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Ack: acknowledgment, an old 
communications terni that was used 
with Teletype systems. The telegra¬ 
pher on the receiving end would in¬ 
dicate that a connection was made, 
or a message received, either by 
sending AK (later ACK) or their ini¬ 
tials. 

ADC: analog-to-digital conversion. 

AIIM: Association for Information 
and Image Management. 

ANSI: American National Stan¬ 
dards Institute, the governing body 
for the management and creation of 
standards (see IEEE, MSC, SPARC, 
and TCCM). 

AT: Advanced Technology, an 
IBM-fostered term describing the ba¬ 
sis of the PC bus architecture. This 
architecture built on the 8-bit and 16- 
bit PC and XT (Extended Technol¬ 
ogy) buses of IBM’s earlier machines. 

Ata: at attachment. 

B£R: bit error rate, an important 
metric when referring to storage or 
communications devices, which 
gives the number of bits at fault dur¬ 
ing various types of measures of the 
device. 

BSR: Board of Standards Review. 

Bsy; busy. Again, this is a term 
from telegraphy days. When queried, 
the downline telegrapher would re¬ 
spond with a busy reply, requesting 
a hold on traffic. Later with Teletypes, 
a busy signal was sent out much like 
the busy signal on your telephone 
being high, thus denoting busy, or 
setting of a “busy” bit in a UART, for 
example. 

CAM: common access method/ 
content access memory. 

CBEMA: Computer Business 
Equipment Manufacturers Associa¬ 
tion. 

CISC: complex instruction-set 
computer, a class of processor that 
has a large number of instructions to 
choose from. An example is the Intel 


Glossary 

80386 or Motorola 68040, (See RISC). 

CKD: count-key-diita. 

CR: connection resource. 

CRC: cyclic redundancy check. 

CS: continuous serv'o. 

DADI: Directly Addressable Device 
Interface. 

DASD: direct access storage device, a 
disk or other secondary storage device 
that permits access to a specific sector or 
block of data without first requiring the 
reading of the blocks that precede it. 

DMA: direct memory access, transfer 
of data bet^veen the memory and a pe¬ 
ripheral (or another memory) without in¬ 
tervention of the CPU (besides DMA 
controller initialization). 

DOTS: digital optical tape system. 

DTA: dedicated test article. Various 
test units may be used for various pur¬ 
poses in a test enviomment. A DTA is 
specifically dedicated to test purposes 
and as such may have special test ports 
and special power inputs, and the cabi¬ 
nets may or may not exist. 

DUT: device under test. This refers to 
the actual hardware that is attached to 
test equipment (see UUT). 

ECMA: European Computer Manufac¬ 
turers Association. 

EIA: Electronic Industries Association. 

EISA: Extended Industry Standard Ar¬ 
chitecture. 

Escon: Enterprise Systems Connec¬ 
tion. 

ESD: electrostatic discharge. 

ESDI: Enhanced Small Disk Interface. 
This is the analog to SCSI. This interface, 
which was developed by Maxtor, pro¬ 
vides a fast channel for disk drives. Origi¬ 
nally, the “D" meant device, but the 
developers changed their aspirations 
and decided that a fast interface for disks 
was good enough. IBM did choose ESDI 
for early PS/2 models but is now mov¬ 
ing exclusively to SCSI. This is one of 
the many interfaces that I. Dal Allan of 
ENDL Consulting, Saratoga, Calif, can 
take credit for. 


FAT: file allocation table, a table 
that contains mappings of the physi¬ 
cal locations of all of the dirsters in 
all files on disk storage. 

FBA: fixed-block architecture. 

FC: fiber channel. This is techni¬ 
cally an initialism and is a bad 
choice—writing out fiber channel 
makes more sense than using the 
initials. 

fci: flux changes per inch. 

FDDI: Fiber Distributed Data Inter¬ 
face, protocol description for optical 
networks and copper coaxial cable 
networks. 

FEU: functional equivalent unit, a 
test unit equivalent to the actual fin¬ 
ished device. As such, it has the same 
appearance as the finished article and 
operates in accordance with the func¬ 
tional requirements. 

FFS: flat-file system. 

FOM: figure of merit, another 
poorly thought-out use of acronyms. 
A figure of merit is a number that is 
determined from some weighting 
system that has preestablished 
ix)undaries. 

fps: frames per second. 

FPT: Forced Perfect Termination. 
This is a nifty trick developed by IBM 
to create a tennination scheme that 
follows the changes in the impedance 
of a line. Theoretically, it implies an 
infinite length without attenuation of 
tire signal. However, physics dcjes get 
in the way, and the signal will die at 
some point. 

FRAM: ferromagnetic memory. 

Gbsi: gigabits per square inch. 

HBA; host bus adapter. (This is 
SCSI talk.) 

HIPPI: High-Performance Parallel 
Interface. 

HIPPI-FP: Framing protocol. 

HIPPI-LE: IEEE Std 802.2 link en¬ 
capsulation. 

HIPPI-PH: physical layer. 

HSC: hierarchic'al storage controller. 


70 IEEE Micro 



























Glossary (continued) 

EDC: insulation displacement con¬ 
nectors. 

lEC; International Electrotechnical 
Commission. 

IEEE: Institute of Electrical and 
Electronics Engineers. 

nST: Institute for Information Stor¬ 
age Technology. 

lOPS: input/output (requests) per 
second. 

BPI: Intelligent Peripheral Interface, 
an interconnection standard that 
grew out of the Intelligent System 
Interface developed by ISS Sperry 
Univac. This interface is designed for 
large systems that have very high 
speed I/O channels, 

JISC: Japan Industrial Standards 
Commission. 

JTTCl: Joint Technical Committee 1. 

Lun: logical unit. 

MCAV: modified constant angular 
velocity, a method used on compact 
disks and some read/write optical 
disks. 

Mflops: millions of floating-point 
operations per second, a metric used 
to benchmark the overall prex'essing 
capability of a microprocessor and 
math coprocessor. 

MIG: metal-in-gap, a specialized 
read/write transducer used in high- 
performance disk drives. 

MIPS: millions of instructions per 
second, a measure of a processor’s 
horsepower. It is usually compared 
with a known system such as a Digi¬ 
tal Equipment Corp. PDF 11/780 that 
has a MIPS rating of 1. 

MO: magneto-optical, a technique 
of combining magnetic recording 
with optical recording to produce a 
high-capacity read/write device. 
Geoffrey Bate, a Santa Clara Univer¬ 
sity fellow, pioneered this technique 
while at Verbatim Corp. 

MR: magneto-resistive, a single¬ 
pole read/write transducer used in 
high-performance and high-bit 


density disk drives. 

MSC: Microcomputer Standards Com¬ 
mittee, the governing body for bus 
standards. 

MTBF: mean time betvv^een failures. 

MTBS: mean time between stops. 
This has nothing to do with the rapid 
transit system in your town; rather, it is 
a measure used with rotating memory 
systems. 

MTDL: mean time to data loss. This 
is my favorite acronym. Apparently, it 
describes what we all fear most (by the 
way, I saved my work at this point). 
That’s the loss of all the work we have 
done up to the deadline because of a 
power failure, your cubical mate spills 
coffee down the grill work of your ter¬ 
minal, or the gnomes who control these 
things have decided its your turn in the 
barrel. 

MTSR: mean time to seivice repair. 

MTTR: mean time to repair. 

NIC: newly industrialized country. 
This just shows how silly we get with 
acronyms, but it is actually used. 

NRE: nonrecuning engineering. 

NR2: nonreturn to zero, a recording 
method that uses the zero crossing as a 
reference point. 

NSIC: National Storage Industry' Con¬ 
sortium. 

NWI: new work items. 

OS. O/S: operating system. 

OSF; Open Systems Foundation. 

PCD: printed circuit disk. 

POH: power-on hours. 

PRML: partial response maximum 
likelihood. 

Prej: port reject. 

RISC: reduced instruction-.set com¬ 
puter, a fast processor that has fewer 
instRictions than a CISC and is defined 
as performing a register-to-memory' 
move in one t (machine-cycle time). 

QIC: quarter-inch cartridge; also the 
name of the tape committee, chaired 
by Raymond Freeman, Freeman Assoc., 
Santa Barbara, Calif., which developed 
the quarter-inch foniiat and recording 
standards. 


Rej: reject. Most people believe the 
general use of this term comes from 
the reject switch on phonograph 
record players that were marked “Rej” 
on the top of the knob. 

RLE: mn-length limited, which de¬ 
fines a code combining Is and Os that 
is a mathematical pennutation allow¬ 
ing more information to be encoded 
on one transition. Notably, RLE is 
used in data recording and transmis¬ 
sion. 

R/W; read/write. 

SCSI: Small Computer Systems In¬ 
terface, one of the most important in¬ 
terfaces in the industry. SCSI grew out 
of the Shugart Associates System In¬ 
terface (SASI). The now-defunct 
Shugart took the idea to ANSI under 
the guidance of I. Dal Allan (the father 
of m(Klem-day interfaces), and the re¬ 
sult was SCSI-I. Currently. SCSI-II—a 
robust l6-bit version with 32-bit capa¬ 
bility—is expected to be deemed an 
ANSI standaid soon. 

SD3: project approval request 
(ANSI). 

SPARC; Standards Planning and 
Requirements Committee, the group 
you address when preparing a new 
standard; also a microprocessor archi¬ 
tecture. 

SPOOL: simultaneous peripheral 
operations on line, when a high¬ 
speed device like a disk is interposed 
between a running program and a 
low-speed device such as a printer. 

SSD; solid-state disk. 

SWG: Special working group. 

SSWG: Specific subject working 
group. 

TAG: Technical Advisory Group. 

TC: technical committee. 

TCMM; Technical Committee on 
Microcomputers and Microproces¬ 
sors. 

TFH: tliin-film head, a special read/ 
write transducer that is made using 
semiconductor techniques. This type 
of head permits smaller geometries, 
thus more tracks per inch and greater 
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Glossary (continued) 

density of storage. 

TFM: thin-film media, a special re¬ 
cording media developed by using a 
sputtering process to lay down the 
thinnest possible layer of recording 
material to improve permeability and 
susceptibility to the recording 
process. This technique is used for 
optical recording, especially themial- 


magneto recording (TMR) that uses both 
optical and magnetic methods, 
tpi: tracks per inch. 

TPS: transaction processing system. 
UART; Universal Asynchronous Re¬ 
ceiver Transmitter. 

ULP: Upper Layer Protocol. You can 
tell this came from technologists who 
deal with systems and storage devices. 
Communications people have their own 
jargon, and they would tend to point out 


the layer in the seven-layer OSI 
model. 

UUT: unit under test. (See DUT.) 
UUT is similar to a DUT depending on 
whom you talk to. Sometimes, UUT 
refers to software under test, while 
DUT refers primarily to hardware. 

WORM: write-once, read-many, an 
optical device that laid the foundation 
for the optical systems being used 
today. 


notice that it spells “Coors”—even their 
logo was the top of a beer can. 

Remember NBI, the fast-rising word 
processor company of the seventies 
and early eighties? NBI stood for “noth¬ 
ing but initials.” 

And, one of the more interesting ac¬ 
ronym histories belongs to maga¬ 
zine. Most people erroneously think 
EDN stands for “Electronic Design 
News.” About 40 years ago when EDN 
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continued from p. 6 
prepare any response to your fitst. 

It is the policy of Bit Bucket Con¬ 
sulting Sendees, Ltd. not to have its 
personnel enter into confidentiality 
relationships regarding pending litiga¬ 
tion until the engineer and the com¬ 
pany are sure they can appropriately 
support the position of the party for 
tvhich the consulting/expert ivitness 
sendees are to he perfonned. I tried to 
convey this during our telephone 
conference on November 14, and 
thought that I did. Bid apparently I was 
not successful in properly communicat¬ 
ing that position. I regret any confu¬ 
sion that may have been caused by 
fa ih ire of comm un ication. 

Accordingly, I have not read the 
materials enclosed in your letters and 
will not do so unless and until you ad¬ 
vise me that I can do so without creat¬ 
ing any mutual problems. I will study 
the two SIMM patents and any other 


started, the name meant “Electrical De¬ 
sign News.” However, the magazine 
became generically known as EDN and 
the publishers decided that would l:>e its 
official name. But H. Victor Dnimm, the 
publisher during the seventies and 
eighties , felt £ZWstood for “everything 
a designer needs,” which shows that an 
acronym never outshines brilliance. 

Want to send me your acronym 
story? Mail it to me at the address shown 


documents of public record that you feel 
lean appropriately examine in the light 
of the foregoing policy of Bit Bucket 
Consulting Services, Ltd. Then I will at¬ 
tempt to reach a preliminary decision 
whether 1 feel that the patents are valid 
and whether I can otheruise support the 
basis of your claim against Toshiba and 
NEC as to SlMJdls infringing your two 
patents. I will check with our company 
records to make sure no conflict exists 
and will promptly get hack to you. 

If there is any problem with this, 
please advise promptly. Thank you for 
your interest in having us assist you. 

Sincerely yours. 

Bit Bucket Consulting Services, Ltd. 

By John Q. Kludge 

This is not perfect, but probably will 
work. For one thing, because you can¬ 
not claim not to have seen the enclo¬ 
sures, it may be argued that you must 
have read them. I don't know what you 


earlier; I’ll be happy to receive it. 
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can do about this, unless a lawyer or 
clean-room service reads and screens 
all your mail for you. I welcome sug¬ 
gested improvements from readers. In 
any event, those of you who want to 
consult or serve as expert witnesses 
now have an interesting new problem 
to worry about. 
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Firmware standards 


[lloeP1275 tvorking group is developing a pro¬ 
posed standard for hoot finmvare based on the 
machine-independent Open Boot firmware. 
Mitch Bradley discusses Open Boot's design and 
benefits of standardization. 

I invite readers to send information on a tool 
or method that solves problems, for consideration 
in future columns. -C.W.] 


Mitch Bradley 
Sun Microsystems 


B irmware is the ROM-based software that 
controls a computer between the time it 
is turned on and the time the primary^ 
operating system takes control of the machine. 
Firmware's responsibilities include testing and 
initializing the hardware, detennining the hard¬ 
ware configuration, loading (or booting) the 
operating system, and providing interactive de¬ 
bugging facilities in case of faulty hardware or 
software. 

Historically, firmware designs have been pro¬ 
prietary and often specific to a particular bus or 
instruction set architecture (ISA). This need not 
be the case. Fimiware can be designed to be 
machine-independent and easily portable to dif¬ 
ferent hardware. There is a strong analogy with 
operating systems in this respect. Prior to the 
advent of the portable Unix operating system in 
the mid-seventies, the prevailing wisdom was 
that operating systems must be heavily tuned to 
a particular computer system design and thus 
effectively proprietary to the vendor of that 
system. 
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Standardization 

Standardizing firmware would offer several 
advantages to designers and users, including 

• Consistency across different systems. It 

is easier to learn one firmware command 
set and use it across different systems from 
different vendors than to have to learn a 
different set of commands for every differ¬ 
ent system. 

• Avoiding duplication of effort. Without 
a standard, every computer manufacturer 
must reinvent the firmware wheel. This ef¬ 
fort is largely wasted. The availability of stan¬ 
dard firmware offers manufacturers the 
option of buying proven firmware off the 
shelf, rather than designing and building it 
from scratch. Porting is almost always 
cheaper than writing from scratch. The ex¬ 
istence of a standard also will raise the level 
at which value is added, allowing improve¬ 
ments to be made on top of a proven base, 
rather than continually redesigning the base. 

• Reducing design cycle times. Designers 
would have one less thing to build for each 
new system architecture if standardized firm¬ 
ware is available. 

• Good, not half-hearted, firmware. Since 
firmware is not visible to the average user 
and is rarely cited in marketing literature or 
measured in benchmarks or reviews, it is 
often treated as an afterthought or a neces¬ 
sary evil. Consequently, firmware design has 
not received the same level of attention as 
other, more visible software components. 
The requirements imposed on firmware by 
the desire to standardize it force us to take 
a hard look at many issues that might oth¬ 
erwise be swept under the mg. 
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Among the many firmware solutions 
available on different machines. Open 
Boot is the only one that has been pro¬ 
posed as a multivendor standard. 

Open Boot advantages 

We designed Open Boot finnware 
with open systems and standards in 
mind. We intended from the outset to 
port it to many different machines with 
widely varying bus structures. ISAs, and 
configurations. 

Plug-in drivers. A key Open Boot 
feature is support for self-identifying 
devices. Consider a computer with an 
open expansion bus, such as VMEbus 
or SBus. An independent board ven¬ 
dor (that is, not a system manufacturer) 
of a card that plugs into the bus wants 
the system to recognize and use that 
card. In an operating system environ¬ 
ment, this is easily accomplished. The 
board vendor supplies a driver on a 
diskette that can then be loaded onto 
a hard disk or installed into the oper¬ 
ating system. 

It is more difficult to support third- 
party devices in the firmware environ¬ 
ment because firmware operates before 
the system is ready to read the disk. 
Since it is difficult to merge third-party 
drivers into existing system ROMs, it is 
better to .store a driver in a ROM on 
the card for the plug-in device to which 
it applies. Others have taken this ap¬ 
proach. but most existing firmware sys¬ 
tems store the driver in ISA-dependent 
machine language binary code, and 
thus it only works on computer sys¬ 
tems from a particular vendor. 

Open Boot also u.ses the plug-in 
driver technique. But instead of stor¬ 
ing those drivers in machine language, 
Open Boot uses FCode. FCode is a 
machine-independent, byte-coded in¬ 
termediate language for the Forth pro¬ 
gramming language. It is based on a 
stack-oriented virtual machine that may 
be easily and efficiently implemented 
into any computer. FCode drivers are 
incrementally compiled into .system 
RAM for later execution. 

In addition to its use for firmware 


device drivers, FCode also provides a 
descriptive capability. Plug-in device 
cards use it to report their characteris¬ 
tics to the firmware and system soft¬ 
ware. Such characteristics may include 
the device name, model, revision level, 
register locations, inteinipt levels, sup¬ 
ported features, and any other identifi¬ 
cation infomiation that makes .sen.se for 
the particular device. System software 
may use this infonnation to automati¬ 
cally configure itself for correct opera¬ 
tion with particular devices. 

Interactive debuggers. Open Boot 
u.ses the same runtime system that ex¬ 
ecutes FCode drivers as the basis for 
an interactive Forth language inter¬ 
preter. This interpreter can be used as 
a programmable debugger, allowing 
developers, users, and .service person¬ 
nel to isolate system problems in the 
event of a failure. 

Flexibility. We designed Open Boot 
for adaptability. Its notation and .struc¬ 
ture for naming particular devices is 
based on a hierarchical device tree that 
mimics the bus configuration and 
physical addressing of the machine on 
which it is implemented. Tliis struchire 
applies equally well to simple, single¬ 
bus desktop machines and to back¬ 
room servers with multiple proce.ssors 
and complicated hierarchies of inter¬ 
connected buses. We designed the 
name space for individual device 
names so that allocating names requires 
no central authority. Companies can 
design their products without appeal¬ 
ing to a master name arbiter. 

The Open Boot command language 
is open-ended. In addition to the stan¬ 
dard commands that are present on all 
implementations, an arbitrary number 
of new commands may be added at 
any time, even by the aser. Such addi¬ 
tional commands may provide acce.ss 
to system-specific features or may sim¬ 
ply be customizations for the needs and 
tastes of individual users. 

Maintainability. Field ROM up¬ 
grades can be expensive. Open Boot 
provides a self-patching facility that 
allows many types of firmware bugs 
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to be fixed without changing the sys¬ 
tem ROM. The same facility can be used 
to add additional firmware capabilities 
to .systems in the field, without chang¬ 
ing the ROMs. 

Towards standardization 

Sun Microsystems l:)egan developing 
Open Boot in 1987. We introduced 
version 1 with our Sparcstation 1 ma¬ 
chines. Version 2, introduced with 
Sparcstation 2, corrected a number of 
deficiencies (we learned from our mis¬ 
takes) and added .several new features. 
We now use Open Boot on all of our 
current machines; approximately 
500,000 units in the field employ it. 

Sun has licensed its Open Boot 
implementation to several other com¬ 
puter manufacairers. Force Computers, 
a leading .supplier of VMFbus products, 
intends to use Open Boot on future 
products acro.s.s .several processor fami¬ 
lies and buses. Sun Microsy.stems of¬ 
fers the source code for an 
implementation of Open Boot under a 
licensing arrangement. 

Working group. The lEFE P1275 
Open Boot Working Group is devel¬ 
oping a firmware .standard based on 
Open Boot. The group meets monthly 
at various locations. Membership is 
open to all interested parties. For more 
information, contact the author (.see 
box). 

Several lEEF bus standards are mak¬ 
ing provisions for Open Bool. The 
Futurebus+ standard includes a mecha¬ 
nism for using FCode ROMs for self- 
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identification. The VME-D draft docu¬ 
ment mentions a similar mechanism. 
SBus, under consideration for IEEE 
standardization, uses FCode. 

For a copy of the P1275 draft speci¬ 
fication contact the working group. The 
specification may be used and imple¬ 
mented without licenses or royalties. 

Mitch Bradley is a senior staff engi¬ 
neer at Sun Microsystems. He has de¬ 
signed analog and digital hardware, 
written Uni.\ device drivers and other 
software, and been a system trouble¬ 
shooter. Most recently he worked on 
the design, implementation, and pro¬ 
motion of the Open Boot firmware. He 
owns Bradley Forthware, a small com¬ 
pany specializing in Forth implemen¬ 
tations for various computers. 

Bradley earned a BE in electrical en¬ 
gineering, computer science, and math 
from Vanderbilt University and an MS 
in electrical engineering from Stanford 
University. He also studied for a year 
at Cambridge University on a Churchill 
Scholarship. He is a member of the 
IEEE Computer Society, the Forth In¬ 
terest Group, and the ANSI Forth Stan¬ 
dards Team. 
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The limits of chip density 

Ware Myers, Contributing Editor 

Semiconductor density promises to 
increase for many decades to come, 
James D. Meindl of Rensselaer Poly¬ 
technic Institute predicted in an invited 
lecture to Supercomputing 91 in Albu¬ 
querque last November (J-D. Meindl, 
“Gigascale Integration (GSI) Technol¬ 
ogy,” Proc. Supercomputing 91, IEEE 
Computer Society Press, Los Alamitos, 
Calif., pp. 534-538). His analysis points 
to what he calls “gigascale integration,” 
or one billion transistors on a chip, by 
about the year 2000 with continued 
progress for some 30 years into the new 
century. 

Meindl analyzed a hierarchy of lim¬ 
its on transistor density: fundamental 
limits set by the laws of physics; mate¬ 
rial limits set by the physical structure 
and chemical composition of the likely 
materials, silicon and gallium arsenide; 
device limits such as, in the case of sili¬ 
con MOSFET devices, channel length, 
gate oxide thickness, and others; cir¬ 
cuit limits such as the constraints on 
switching logic circuits; and sy.stem lim¬ 
its such as clock skew. 

In addition, he considered the prac¬ 
tical limits influenced by manufactur¬ 
ing technology and economics. Meindl 
asked, “How many components/tran¬ 
sistors can we expect to fabricate in a 
single silicon chip that will prove to be 
useful, i.e., be economically viable, at 
.some designated future time?" He quan¬ 
tified the practical limits in terms of 
three macrovariables: minimum feature 
size, the square root of the die area. 
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and the packing efficiency, that is, the 
number of transistors or components 
per minimum feature area. 

From principles of physics such as 
the Heisenberg uncertainty principle, 
Meindl derived two fundamental lim¬ 
its which he plotted on a log-log 
power-time delay field. One side of 
these curves constitutes an impossible 
region, an area where the power or 
the time delay is less than that required 
to perform a .switching operation. 

In a similar manner he derived vari¬ 
ous other limits. He plotted limits on 
switching operations on the power¬ 
time delay field. Limits on transmission 
operations were plotted on a field of 
interconnection path length versus the 
corresponding signal propagation or 
response time. These lines gradually 
boxed in areas on the fields where op¬ 
erations are feasible. 

Along the way he discovered that 
“on the basis of bulk material limits 
per .se, silicon is not inferior to gallium 
arsenide as a material for gigascale in- 
tegratictn.” That is, as .switching circuits 
get clo.ser to the limits he hypothesizes, 
silicon will reassert itself over gallium 
arsenide’s present electron-mobility 
advantage. 

After considering a number of 
metrics and limits, Meindl felt that one 
figure of merit, in particular, was sin¬ 
gularly instaictive. The chip perfor¬ 
mance index, as he calls it, is the 
number of transistors on a chip, di¬ 
vided by the power- delay product. By 
power, he means the average power 
consumption during a binary switch¬ 
ing transition of a logic gate. By delay. 
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he refers to the corresponding transi¬ 
tion (or delay) time. Thus, as the num¬ 
ber of transistors increases, the chip 
performance index also increases. And, 
as the power and/or delay time (in the 
denominator of the equation) becomes 
smaller, the chip performance index 
further increases. 

This index increased by about 
from i 960 through 1990, Meindl cal¬ 
culated. He projects it to increase by 
another factor of 10*’ from 1990 through 
2020 . 

Tiny lasers 

Researchers at AT&T Bell Laborato¬ 
ries have made what they believe is 
the world’s smallest semiconductor la¬ 
ser: about 5 microns in diameter. Seen 
through a scanning electron micro¬ 
scope, the lasers look like microscopic 
thumbtacks with the head of each tack 
400 atoms thick. 

The lasers operate in what is called 
a “whispering gallery” mode, so named 
after the sound effect noted in large 
cathedrals, where a whisper along the 
wall can be heard all along the inside 
perimeter. Like these whispers, pho¬ 
tons travel with low losses around the 
edge of the laser. The lasers can be 
used as surface- or side-emitting de¬ 
vices. Each semiconductor disk laser 
is made of one or more layers of in¬ 
dium gallium arsenide sandwiched 
between two layers of indium gallium 
arsenide phosphide. 



AT&T's micro disk laser 


Nanotechnology 

As technology becomes feasible on 
smaller and smaller scales, scientists 
continue to explore organic and inor- 
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ganic paths in search of building blocks 
for what may one day be computers 
that function at the molecular or atomic 
levels. 

Natural nanocircuits. One bio¬ 
physicist is studying proteins that con¬ 
duct one-electron currents to find a 
model for protein-based nanocircuits— 
circuits made of organic materials 1,000 
times smaller than those used today. 

Jose Nelson Onuchic, an assistant 
professor of physics at the University 
of California, San Diego, believes it may 
be possible to design synthetic proteins 
that modulate minute currents, based 
on those found in vision, photosynthe¬ 
sis, and other biological functions. 
Onuchic believes an advantage to this 
line of study, as opposed to the tradi¬ 
tional study of inorganic materials, is 
that nature has already evolved spe¬ 
cial proteins to perform the tasks com¬ 
puter designers are trying to emulate. 

The goal of his research is the 
biochip, a molecular electron device 
that would include the protein or other 
organic chemical equivalents not only 
of wires but also of other computational 
gear, such as junctions, switches, re¬ 
sistors, and amplifiers. 

Towards that goal, Onuchic and his 
team are pursuing three lines of re¬ 
search. The first is biochemical, in 
which researchers study the features 
of natural proteins that control the rate 


of electron tunneling within and be¬ 
tween the amino acids that form the 
links of protein chains. The electron 
transfer rate depends on the nature of 
the chemical bonds traversed by the 
tunneling electrons as they jump from 
atcjiii to atom and by the geometry of 
the protein. Therefore, a second line 
of research explores the rules that gov¬ 
ern the folding of natural proteins into 
their effective shapes. 

The third line of research, which 
builds on the results of the other two 
lines, is directed toward making a work¬ 
ing molecular electronic device. These 
efforts aim at creating a one-molecule 
memory element called a shift regis¬ 
ter, along with a one-molecule ampli¬ 
fier to read the information stored in 
the shift register. Onuchic believes it 
might also be possible for proteins used 
as memory and computing elements 
to self-assemble on biochip surfaces, 
as proteins do in living cells, 

Molecular manufacturing. Alter¬ 
natively, some researchers are work¬ 
ing towards molecular manufacturing, 
the ability to build nanomachines one 
molecule or atom at a time. In contrast 
to contemporary microcircuits, in which 
billions of electrons flow from one 
place to another, nanometer-scale cir¬ 
cuits might perform the same opera¬ 
tion with the change in shape or 
position of a single molecule. 


Micro bits 


Amador Corporation, a Minnesota- 
based, conformity assessment testing 
laboratory, has announced a coop¬ 
erative arrangement with the Geniian 
VDE Testing and Certification Insti¬ 
tute to test US information tech¬ 
nology equipment for export to 
Germany. 

IBM and the Center for Advanced 
Research in Biotechnology are de¬ 
veloping a portable software system 

for computational structural biol¬ 
ogy, including programs to compute 


protein structural data and model 
protein molecules. 

Symbmath, an expert system that 
solves mathematic problems in sym¬ 
bolic formula or through numeric 
computation is available as share¬ 
ware under the file name SM13A.ZIP 
from the directory MSDOS.Calculator 
in Simtel 20 at File Transfer Protocol 
sites. Symbmath requires significantly 
less RAM than most comparable soft¬ 
ware—640 Kbytes, as opposed to as 
much as 4 Mbytes. 













To build such small devices, scien¬ 
tists must learn how to manipulate in¬ 
dividual molecules or atoms. Current 
methods of molecular manipulation 
rely on the random motions of atoms 
to form the desired compounds. But 
molecular nanotechnology aims at 
moving individual atoms or molecules 
and snapping them into place. 

Some progress has already been 
made. Scanning probe microscopes 
drag an ultra-fine stylus over a micro¬ 
scopic surface, thus mapping out its 
shape and allowing scientists to see in¬ 
dividual atoms. Researchers have found 
that they can sometimes get atoms to 
stick to the microscope tip, enabling 
them to move the atoms around. 

Using this technique, IBM research¬ 
ers in San Jose in 1989 positioned 35 
atoms of xenon to spell out their cor¬ 
porate logo. The achievement prompt¬ 
ed others to create their own nanoart: 
scientists at Hitachi’s Central Research 
Laboratory in Japan spelled out “Peace 
'91” by removing sulfur atoms from mo¬ 
lybdenum disulfide, and a Stanford 
University student inscribed the first 
page of A Tale of Two Cities on a sur¬ 
face about the size of a red blood cell. 

K. Eric Drexler, a visiting scholar at 
Stanford, is one of the founders of the 
Foresight Institute and the Institute of 
Molecular Manufacturing. He argues 
that manufacture at the molecular level 
is feasible and will one day produce 
nanomachines controlled by submi¬ 
cron. 1,000-MIPS CPUs and powered 
by nanomotors. Among the uses of 
such devices could be swimming 
through blood vessels to find and de¬ 
stroy viruses or bonding chlorine at¬ 
oms from the atmosphere to protect 
the ozone layer. The machines would 
be manufactured by other nanoma¬ 
chines called molecular assemblers, tiny 
robot anus that position molecules pre¬ 
cisely, bonding them into place in the 
design. (See related story on p. 7) 

Alternative methods rely on the ten¬ 
dency of certain molecular components 
to stick to one another in the design of 
structures, thus enabling self-assem¬ 


bling or self-replicating devices. Nadian 
Seeman, a chemistry professor at New 
York University, has created a design 
that allowed DNA strands to assemble 
themselves into a cul:>elike staicture. 
Julius Reebek, Jr., a chemist at Massa¬ 
chusetts Institute of Technology, de¬ 
signed a synthetic molecule that serves 
as a template for two other molecules 
that combine to form a copy of the 
original. 

1.2-ps photodetectors 

A graduate student at the University 
of Michigan has used low-temperature- 
grown gallium arsenide to produce 
what she says may be the fastest light¬ 
detecting microchip. The device’s 1.2- 
ps response time is six times faster than 
commercial photodetectors. 

The chip’s developer, Yi Chen, cred¬ 
its the special type of gallium arsenide, 
developed by researchers at MIT, for 
its speed. The material responds very 
quickly to light, creating a current that 
stops and starts instantly as a laser pulse 
starts and stops. 



Chen's interdigitated electrodes 


Chen's biggest technical challenge 
was developing an electrode staicture 
to match the sensitivity of the gallium 
arsenide. She developed a way to use 
submicron electron-lx^am lithography 
to create interdigitated electrodes about 
0.2 microns or 2,CK)0 angstroms apait. 
By condensing the photodetector's 
electrodes into a smaller area, the de¬ 
sign prevents the loss of signal-carry¬ 
ing electrons in the gaps between 
electrodes. The detector achieves an 
internal quantum efficiency of (>8 per¬ 
cent (collecting 68 out of every 100 


electrons) as opposed to traditional 
electrode structures that achieve about 

I percent. 

According to Chen, the device holds 
promise for fiber-optic communication 
networks hundreds of times faster than 
current systems, precision 3D inspec¬ 
tion systems, vehicle collision avoid¬ 
ance systems, and advanced medical 
imaging techniques. Chen is a gradu¬ 
ate saident in the school’s applied phys¬ 
ics program and a post-doctoral fellow 
at AT&T Bell Laboratories. 

Cobol coinventor dies 

Rear Admiral Grace Murray Hopper, 
known as “the first lady of software,” 
died at her home in Arlington, Virginia, 
on January 4. She was 85. 

Hopper was a pioneer computer 
programmer for the US Navy and 
coinventor of Cobol. She is credited 
with coining the term hug to describe 
the problems that plague computers 
and programs. 

Admirers described I lopper as a vig¬ 
orous, tireless, and occasionally con¬ 
trary woman, with a healthy contempt 
for those unwilling to try new ideas. 
She once said, “The only phrase I've 
ever disliked is, 'We’ve always done it 
that way.’ ” 

Hopper earned a PhD in mathemat¬ 
ics from Yale University and taught at 
Vassal' College before joining the Na¬ 
val Reserve in 1943. After World War 

II she remained in the Naval Reserve 
and joined the company building the 
Univac I. The company later merged 
into Sperry, where she developed an 
idea that lead to Cobol. She was the 
US’s oldest active duty military officer 
when she retired in 1986. 
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Windows software 

Speech from Windows 

Using synthesized speech. Monologue for 
Windows reads aloud text, numlx^rs, and data 
from Windows 3.0 and its applications. As in the 
manufacturer’s original version, Monologue for 
Windows converts digital information into simu¬ 
lated speech in two phases using a set of pho¬ 
netic translation and pronunciation rules. The 
software analyzes and translates text into sound 
descriptors, a phonetic language containing more 
than 1,000 rules to incorporate pitch, duration, 
and amplitude codes. Then it converts the lan¬ 
guage into speech signals. Algorithms drive 
speech tables that incorporate mles for merging 
signals into continuous speech. 

Users can adjust the speed, pitch, and volume 
on screen. A dictionary manager lets users in¬ 
struct the software how to pronounce words, 
abbreviations, acronyms, and symbols that may 
not comply with phonetic rules. The software 
requires 2 Mbytes of RAM, MS/PC-DOS 3.1 or 
higher, Windows 3.0 or higher, and a hard disk 
with at least 2 Mbytes available. First Byte; $149. 

Reader Service No. 10 

Video window 

X.TV is a video window that users can ma¬ 
nipulate like any other window on workstations 
running X windows. Users can reposition X.TV 
anywhere on the screen, scale it to full screen, 
or reduce it to icon size. An on-screen control 
panel accesses audio-video functions, including 
volume, brightness, and contrast. A software 
push-button activates the frame grabber. Images 
can be named and saved to a disk file in raw 
data, TIFF, or Targa fonnats. 

X.TV receives images through VMEbus, SCSI, 
or RS-232 port buses from the manufacturer's 
RGB/View hardware. RGB/View processes im¬ 


ages independently from the CPU and receives 
from a variety of sources including video cam¬ 
eras. scanning electron microscopes, and infra¬ 
red devices. RGB, $750 (X.TV), from $8,995 
(RGB/View). 


Reader Service No. 11 



RGB's X.TV video window 


VMEbus development kit 

Programmers can develop Microsoft Windows 
3.0 applications in a VMEbus environment us¬ 
ing the XVME-984 Windows VMEl:)us Toolkit. 
The kit is a utilities package designed to run 
with the manufacturer’s line of PC/AT VMEbus 
processors and support modules. It includes 
VMEbus Manager, a Windows 3.0 application 
that accesses and monitors the VMEbus. A li¬ 
brary of functions contains the low-level code 
that configures and communicates with the 
manufacturer’s VME I/O products. Demonstra¬ 
tion programs are also included that offer ex¬ 
amples of library uses. Xycom; $650 (with 
Windows). $500 (toolkit only). 

Reader Service No. 12 

PC-mainframe software 

The Extra PC-to-mainframe software for Mi- 
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Texas Instruments says its 
l’MS320C40 is the first digital signal 
processor built specifically for parallel 
processing. One of the biggest chal¬ 
lenges facing the chip’s design team 
was how to debug a chain or array of 
multiple processors. Chief architect 
and program manager Ray Simar had 
seen customers purchase individual 
debuggers for each processor in a 
parallel system. 

"Those debuggers cost $10,000 to 
$20,000 each,” Simar said. "This gels 
prohibitive if you have l6 prcx'essors.” 

Space becomes a big problem, too. 
"They’d have nowhere to put them,” 
he added, “so they'd have them hang¬ 
ing from the ceilings.” 

TI avoids rooms full of hanging 
debugger boxes by incoiporating an 
analysis module onto each C40 chip. 
The module links through a JTAG 
(IEEE Std. 1149.1) port to debugger 
software, accessible in a windows- 
format interface. Users can monitor 
and test processors in the system indi¬ 
vidually or globally. 

The analysis module is one of the 
key features of the C40, which Simar 
says is the first DSP built specifically 
for parallel processing. His team de¬ 
signed the chip from the ground up 
after looking at customers' difficulties 
incorporating their previous-genera¬ 
tion DSP, the C30, into parallel pro¬ 
cessing systenis. 

According to Simar, customers 
linked the C30s together with multi- 


Parallel-processing DSP 



Texas Instruments' TMS320C40 DSP 


chip FIFO memories, an expensive pro¬ 
cess that also takes up space. "What 
we've done, essentially, is integrate 12 
FIFOs onto the chip.” he said. The C40 
has six communication ports that link di¬ 
rectly to other C40s with no external 
logic required. 'Fhus, they link together 
tightly in pipelines, 2D arrays, or 3D ar¬ 
rays. This close proximity helps the C40 
achieve what TI says is the highest per¬ 
formance of any floating-point proces¬ 
sor: 275 MOPs with 320-Mbyte/s 
throughput per C40. Simar says the chips 
achieve even higher perfoimance in par¬ 
allel processing systems. 

An on-chip, six-channel DMA copro¬ 
cessor acts as a clerk for the on-chip 


CPU. To allow the CPU to function 
smoothly, the DMA receives data, as¬ 
sembles it into packets, and stores it 
in a memoiy buffer (if necessary) un¬ 
til the CPU is ready for it. Tlie C40 has 
an 8-Kbyte memory and two external 
memory buses to global and local 
memory. Through the interface, users 
choose how much on-chip memory 
space is allocated for the DMA and 
how much for the CPU. 

Among the Ux^ls available from the 
manufacturer are an in-circuit emula¬ 
tor that provides parallel debug capa¬ 
bilities for embedded applications, a 
parallel-prcKessing development sys¬ 
tem, an ANSI-compatible C compiler 
with parallel-processing runtime sup¬ 
port library, an assembler and linker, 
and a state-accurate simulator. Third- 
party tools include the Multiprox 
code generation systems, an optimiz¬ 
ing Ada compiler, and the Spox real¬ 
time operating system. 

Simar says the C40 is suited for 
medical imaging (ultrasound, mag¬ 
netic resonance, and CAT scans), pat¬ 
tern recognition for robotics, radar 
processing, 3D graphics, and military 
uses (image recognition and track¬ 
ing). He also sees prospects for some 
nontraditional uses, including high- 
performance accelerators that attach 
directly to personal computers or 
workstations, high-bandwidth data 
transmission, and high-speed net¬ 
work interface. TI plans to ship in vol¬ 
ume by mid 1992. Texas Imtmments. 

Reader Service No. 13 


crosoft Windows includes several easy- 
to-use features. Dynamic data exchange 
macros create automated information 
links between the mainframe and Win¬ 
dows applications. APL character sup¬ 
port assists in financial and statistical 
analysis. 

Extra also supports light pens, a fea¬ 
ture designed primarily for the health 


care industry. Command-line file trans¬ 
fer capability, for users more comfort¬ 
able with DOS syntax, allows 
concurrent transfer of files using mac¬ 
ros. A diagnostic trace facility allows 
users to record communications events 
on the network, to identify problems. 
Attachmate; $425. $75 (upgrades). 

Reader Service No. 14 


Cut and paste from X to MS 

An X server integrates the X Window 
System in a Microsoft Windows envi¬ 
ronment. Xoftware for Windows allows 
users to cut and paste l-ietween X and 
MS Windows applicatioas and use MS 
Windows as a local window manager. 
It also features a complete on-line help 
system and several conveniences for 
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installing, configuring, and starting X 
applications from within the MS Win¬ 
dows environment. 

Xoftware for PC Unix, release 2.1, is 
an X server for 386/486-based machines 
running SCO or Interactive Unix. It adds 
support for Texas Instruments’ 34020- 
based graphics accelerator board and a 
hot-key function allows users to switch 
to SCOs Multiscreen application. A^e; 
$495 (Xoftware for Windows), $595 
(Xoftware for PC Unix). 

Reader Service No. 15 

DOS-to-Unix connectivity 

Atemi software allows wo-way file 
transfer between systems and can emu¬ 
late ANSI color terminals, graphics sup¬ 
port. multiple screen display, and a 
menu-driven interface. It features a hot¬ 
key function that allows seamless 
movement between DOS and Unix sys¬ 
tems. PC users can am DOS applica¬ 
tions and switch immediately to 
applications ainning on the Unix host. 
Aterm is compatible with MS-DOS and 
Windows. Specialix: from $125. 

Reader Service No. 16 


Signal-processing 
hardware and software 

Background data collection 

Easy Data collects data, mcxlifies it, 
and sends it to the keyboard buffer 
without affecting normal keyboard 
functions. It imports data in the back¬ 
ground while users work with another 
program in the foreground. Easy Data 
works with most software that receives 
data entry through the keyboard. Key¬ 
board characters and macros can be in¬ 
serted automatically before and after 
each data field to simulate the same 
keys that would be pressed in manual 
data entry. Data can also \:>e selectively 
parsed so that only the required data 
transfers. Labtronics; $145. 

Reader Service No. 17 

1,280-Mflops performance 

Two of Texas Instruments’ C40 DSP 
chips (see box on p. 79 and diagram 
l')elow) are put to use in the Spirit-40 
AT, an 80-Mflops DSP engine. Each 40- 
Mflops C40 includes 1 Mbyte of SRAM 
on the main bus, 1 Mbyte of SRAM on 


the local bus, 64 Kbytes of boot 
EPROM, and memory expansion for up 
to 16 Mbytes of DRAM. 

With the C40’s six communication 
links, the Spirit-40 configures up to a 
six-dimeasional hypercube engine with 
64 nodes. The board occupies a l6-bit 
ISA slot and is compatible with 386- 
and 486-based machines. Up to eight 
boards fit in a passive backplane ISA 
system, yielding 640-Mflops perfor¬ 
mance. Two backplanes yield 1,280- 
Mflops peak performance. Sonitech 
International: $8,995. 

Reader Service No. 18 

DSP package 

Filter designers can analyze the per- 
fonuance of their designs on Monarch, 
a menu-driven DSP package with filter 
design, signal analysis, and graphical 
capabilities. Users can design finite im¬ 
pulse response filters (up to a maxi¬ 
mum order of 512) or infinite impulse 
response filters. 

The package includes Siglab, a DSP 
language with over 100 mathematical 
and system operations for performing 



Sonitech's Spirit-40 AT 
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signal and system analysis. Designers 
can create their own algorithms and 
DSP operations, which can be synthe¬ 
sized, tested, and saved for future use 
in other applications. Graphical dis¬ 
plays feature overlay, zoom, grid, color, 
line style, and data location support. 
Monarch simulates real-world condi¬ 
tions, such as noise and fixed-point 
analysis. It requires MS-DOS 3.0 or 
higher, 640-Kbyte RAM, and a hard disk 
(or 2-Mbyte floppy). Dynacomp; 
$549.95, $10 (demonstration disk). 

Reader Service No. 19 


Communication 
hardware and software 

Controls 255 nodes 

COM20010 is a token-passing com¬ 
munications controller designed for 
high-speed intelligent data highways. 
The 2.5-Mbps device features a 
microcontroller interface and IK x 8 on¬ 
board buffer RAM. It uses an Arcnet 
protocol engine and a token-passing 
protocttl for real-time deterministic per¬ 
formance that supports data packet 
sizes from 4 to 512 bytes. 

Up to 255 peer-to-peer nodes can 
network via the COM20010 at a data 
rate of 2.5 Mbps. System designers can 
implement peer-to-peer or master-slave 
communication schemes. The CMOS- 
fabricated device runs on a +5V power 
supply and comes in commercial and 
industrial temperaaire ranges. Standard 
Microsystems; $11.36 (1,000s). 

Reader Service No. 20 

LANs connect to WANs 

Local area networks connect into 
wide area networks with the LAN 
Transport Management System. IAN 
TMS, suited for Token Ring and 
Ethernet environments, eases intercon¬ 
nection between diverse LAN proto¬ 
cols, such as IBM Net BIOS, SNA, TCP/ 
IP, Novell Netware, IBM server, and 
Banyan Vines. Synchronous data link 
control pass-through capability enables 
it to merge a non-LAN serial data stream 
with regular Token Ring traffic for trans¬ 


mission over a common WAN link. 
Network administrators can monitor 
and troubleshoot problems on geo¬ 
graphically dispersed LANs from a cen¬ 
tral management console. General 
Datacomm; from $4,600. 

Reader Service No. 21 

Gateway software 

Up to 30 Netware users can access 
transmission control protocol and inter¬ 
nal protocol applications with Catipult 
gateway softw^are. Its TCP/IP applica¬ 
tions mn over IPX, using Novell’s Net 
BIOS. From the user's PC, Catipult’s 
applications are routed to the OS/2 
gateway, which replaces IPX with the 
TCP/IP transport protocols and sends 
the application packets to other hosts. 
The product runs on OS/2 PCs 
equipped with a network interface card 
supported by Novell's Netware Re¬ 
quester for OS/2 and over Ethernet, 
Token Ring, Arcnet, and Broadband 
networks running Netware 2.x or 
Netware 3.x. Ipswitch; $2,975. 

Reader Service No. 22 

Sparcstations link to IBM 

Two software products connect Sun 
Microsystems Sparcstations and IBM’s 
system network architecKire (SNA) over 
a Token Ring. Brx TR/SNA transpar¬ 
ently supports all SNA serv'ices and sup¬ 
ports local network management. Brx 
TR/IP leverages the transfer functions 
of the Token Ring network to include 
Unix systems. The technology enables 
traditional TCP/IP applications such as 
NFS, File Transfer, Mail, and Remote 
Login to operate seamlessly over To¬ 
ken Ring. Users can access remote 
Sparcstations as if they were locally 
connected to the same network, 
Brixton Systems; $995 (both products 
and Token Ring card). 

Reader Service No. 23 


Special boards 

Video interface card 

The MZ-UTOl interface card estab¬ 
lishes a video rate path between the 


Scorpion real-time VGA frame grabber 
and Eighteen Eight Laboratories’ 
PL2500 floating-point array processors. 
Multiple 640 X 480 x 8-bit images can 
be transferred in real time between the 
video card and the amay processor. The 
half-length mezzanine card mounts 
directly on the PL2500 and draws 
power and ground through Span 32 
bus interface connectors. The MZ-UTOl 
card comes with software to control 
the Scorpion board, interconnection 
cables, and a reference manual. 
Univision Technologies; $2,495. 

Reader Service No. 24 

16.7 million colors 

The Volante family of graphics pro¬ 
cessors offers three alternatives to PC 
users. The AT2000, compatible with PC 
ATs, displays up to 16.7 million colors 
at 640 x 480 resolution. At higher reso¬ 
lution (up to 1,024 x 768) it displays 
32,768 colors. The MClOOO, compat¬ 
ible with Micro Channel PCs, displays 
256 colors at 1,024 x 768 resolution. 
The AT800, also compatible with PC 
ATs, displays 256 colors. 

The processors, based on Texas In¬ 
struments’ TMS34020, have a refresh 
rate of 72 Hz and a maximum band¬ 
width of 73 MHz. Their graphical in¬ 
terface works with MS Windows, Tiga, 
X Windows, and Nova Graphics CGI. 
National Design; $1,295 (AT2000), 
$995 (MClOOO), $795(AT800). 

Reader Service No. 25 



National Design's AT2000, MClOOO, 
AT800 


Flicker-free screens 

The Windows VGA 8800 graphics 
board boosts processing speeds for 
Windows 3.0 menus, windows, scroll- 
continued on p. 83 
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New Products 


Mips has extended its RISC archi¬ 
tecture to the 64-liit format with the 
R4000 microprocessor, a 1.3-million 
transi.stor chip designed to handle the 
growing volume of data found not 
only in large processors but in desk¬ 
top machines as well. 

According to Andy Keane, prod¬ 
uct development manager for the 
R4000, company designers looked in 
the direction RISC architecture was 
moving and decided to target the 
volume desktop market. "The early 
focus (of RISC) was on large ma¬ 
chines, especially for databases,” 
Keane said. "Now, RISC is more 
broadly defined.” The R4000 is aimed 
at a range of applications, including 
high-end PCs, graphics workstations, 
databa.se seivers, and multiprocess¬ 
ing .systems. 

As u.sers grapple with increasingly 
large amounts of data, the current 
generation of 32-bit machines, which 
handles up to 4 Gbytes of informa¬ 
tion. is becoming ob.solete. 

Ba.sed on ob.servations that the 


64-bit RISC microprocessor 

memory requirements of the average 
program grow by a factor of 1.5 to 2 
each year, the company expects main- 
•stream applications to exceed the capa¬ 
bility of 32-bit machines by the 
mid-nineties. Doubling the number of 
bits in the microprocessor enables it to 
handle up to a terabyte of data. 

Keane said his company is targeting 
PC users who are interested in the per¬ 
formance of a workstation but don't 
want to give up their investment in soft¬ 
ware, The R4000 is compatible with 
R3000 and R6000 applications, as well as 
Windows NT, Santa Cruz Operation's 
Desktop, and RlSC/os, Mips' implemen¬ 
tation of Unix. The chip performs in a 
32-bit subset mode for most applica¬ 
tions, employing its larger address only 
when necessaiyc 

A 50-MHz clock drives the chip and 
operates a 100-MHz superpipeline. On- 
chip CPU components include a 64-bit 
integer processor. 64-bit floating- 
coprocessor, memory management unit, 
8 -Kbyte instmction cache, 8-Kbyte data 
cache, control and management facili¬ 


ties for primary and secondary cache, 
and multiprocessing capabilities. 

The R4000 comes in three versions. 

• R4000PC, which supports primary 
on-chip cache, is a 179-pin grid ar¬ 
ray package aimed at desktop, low- 
end servers, and embedded control 
systems. 

• R4000SC, with secondary cache for 
uniprocessing applications, is 
offered in 447-PGA or land grid ar¬ 
ray packages intended for high- 
performance desktops and servers. 

• R4000MC, with multiprocessing 
features and secondary cache, also 
comes in a 447-PGA or -LGA 
package. 

Based on simulations of the SPEC 
benchmark suite, perfomiance ranges 
from 40 Specmarks for the R4000PC 
to 60 Specmarks for the other two 
versions. According to the company, 
comparable Specmark perfonnance 
has previously been achieved only 
from implementations of five to nine 
chips. 

Among the development tools 
available are a C RISC compiler and 
a sy.stems programmer package. The 
latter includes a cache memory simu¬ 
lator, an architecture simulator, and 
a development package. 

Through the Advanced Computing 
Environment Initiative, a consortium 
of software, systems, and semicon¬ 
ductor companies that promotes an 
open computing environment. Mips 
has licensed six semiconductor 
manufacturers to produce the R4000. 
Last year Olivetti, Acer, and Mips pre¬ 
viewed PC-size machines using the 
R4000. Mips expects commercially 
available machines using its micro¬ 
processor available in 1992. Mips; 
$700 to $1,200. through licensed 
manufacturers. 

Reader Service No. 26 
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ing, and fonts up to 30 times faster than 
Super VGA standards, according to the 
manufacturer. A refresh rate of 70 Hz 
reduces screen flicker. Other features 
include 1 Mbyte of memory, 256 col¬ 
ors, and resolution up to 1,280 x 968 
pixels. Genoa Systems; $495. 

Reader Service No. 27 

CSBus-to-VMEbus link 

Sun Sparcstations can interface with 
a VMEbus system with the Model 467 
adapter. The adapter includes an SBus 
card and a 6U VME card, connected 
with a shielded cable. The systems 
communicate in two ways. Memory 
mapping permits the SBus and VMEbus 
host processor in one chassis to ex¬ 
ecute random access reads and writes 
to the destination bus as it would to a 
local memory. Alternatively, a built-in 
DMA controller permits transfers from 
the memory of one system to the other 
as fast as 20 Mbytes/s. Bit 3 Computer 
Corp.: $2,850. 

Reader Service No. 28 



Bit 3's Model 467 adapter 


Storage devices 

Solid-state disk emulator 

The Blue Flame III solid-state, non¬ 
volatile disk emulator boosts access 
speed by up to 20 times. The DOS- 
compatible card is an I/O mapped de¬ 
vice built from 14 SIMMs. It features a 
16 -bit data path and transfers data up 
to 4 Mbytes/s. Capacities range from 2 
to 56 Mbytes. Each card fits in a full- 


length, l6-bit ISA bus slot. Semi Disk 
Systems; from $595. 

Reader Service No. 29 

20-Mbyte Mac floppy 

Quad Flextra is a very high density 
floppy-disk subsystem for the Macin¬ 
tosh with a 35-ms seek time and a data 
transfer rate of 1.25 Mbytes/s. Each 3.5- 
in. disk in the subsy,stem has a format¬ 
ted capacity of 20.4 Mbytes, enough to 
hold font libraries, large graphics, and 
desktop publishing files. The external 
floppy subsystem measures 2.25 x 6.81 
X 8.38-in., weighs less than 4 lbs., and 
has two SCSI connectors. The system 
includes a shielded cable, driver, util¬ 
ity software, and four disks. Quadram; 
$895. $25 (additional disks). 

Reader Service No. 30 

4.3-Gbyte tape drive 

Two quarter-inch cartridge tape 
drives store up to 2.15 Gbytes of 
uncompressed data or 4.3 Gbytes of 
compressed data. The 9200 and the 
9200C use a track density of 30 ser¬ 
pentine recording tracks and a record 
density of 67,733 bpi. 

The drives support a synchronous 
burst transfer rate of 4.8 Mbytes/s and 
a sustained host transfer rate of 400 to 
600 Kbytes/s for the 9200 and from 800 
to 1,200 Kbytes/s for the 9200C. Ac¬ 
cording to the manufacturer, each de¬ 
vice accesses any file on a tape in two 
minutes or less and specifies a non- 
recoverable error rate of no more than 
1 in 10'’ bits. Wangtek; $900 (9200), 
$1,100 (92000 (OEMquantities.) 

Reader Service No. 31 

Winchesters for portables 

Portable, laptop, and notebook com¬ 
puters can support more complex ap¬ 
plications with the higher storage 
capacities of two Winchester disk 
drives. The MK-2024FC has a format¬ 
ted capacity of 86 Mbytes and an aver¬ 
age access time of 19 ms. The 
MK-2124FC holds 130 Mbytes and sup¬ 
ports 17-ms access. Both drives come 
in a 0.75 x 2.5-in., 6-oz. package and 


consume 1.8 watts when active and 
0.15 W when inactive. Each drive has 
a 32-Kbyte cache memory and a 5- 
Mbyte/s data transfer rate. Toshiba; 
$495 (MK-2024FC) in (OEM quanti¬ 
ties). $695 (MK-2124FC). 

Reader Service No. 32 



Toshiba's MK-2024FC 


Laptop upgrades 

Laptop Solutions upgrades hard disks 
for Toshiba laptops and notebooks, to 
store up to 120 Mbytes of data. Each 
upgrade includes installation and par¬ 
titioning of the hard drive to user speci¬ 
fications, formatting of each logical 
drive, complete .system diagnostics, and 
a one-year parts and labor warranty. 
The Houston-based company guaran¬ 
tees a 48-hour turnaround, including a 
24-hour burn-in and test. Laptop Solu¬ 
tions; from $695. 

Reader Service No. 33 
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Indicate your intere.st in this department 
by circling the appropriate number on 
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Product Summaiy 

Joe Hootman 

University of North Dakota 


Manufacturer Model Comments 


R.S.# 


Chips 

Cirms Logic CL-GD6411 A 3.3V, one-chip VGA controller supports prototyping and system 80 

Graphics development for notebook computers with 64 gray shades on a 

controller monochrome LCD, The device can directly drive a 512-color, active- 

matrix LCD. Features include simultaneous LCD and CRT displays, 
host bus interface logic, RAMDAC, and memory control logic. 160- 
pin quad flat pack; $65 (100s). 

Cybernetic Micro Systems CY545B/CY500 The CMOS CY545B uses pulse and direction signals to generate up 81 

controllers to 27,000 steps/s for full-, half-, quad-, and microstep applications. 

The CY500 provides programmable acceleration slopes for appli¬ 
cations requiring 1 step/minute to 2,000 steps/s. Both accept ASCII 
or binary commands from a serial or parallel host. 44-pin PLCC or 
40-pin DIP from $25 (1,000s) (CY545); 40-pin LSI from $10 (1,000s) 
(CY500). 


Linear Technology 


LSI Logic 


Philips Semiconductors 


Pletronics 


Xicor 


LTC1235 

supervisor 


Sparkit-40/SS2 
chip set 


83C528 

controller 


Clock 

oscillators 


X24C00 

EEPROM 


Microprocessor supervisory circuit resets at IV and offers a condi- 82 
tional battery backup feaaire for RAM data. Available in commercial 
or industrial grades and l6-pin SO or plastic DIP. $3.85 (100s) 

(plastic, commercial). 

Manufacturers developing workstations compatible with the Sun 83 
Sparcstation 2 can sample the 40-MHz chip set, which includes the 
L64841 MMU, L64844 cache controller, and the L64846 DRAM 
controller. Manufactured for Sun and sold under license, the set 
works with three graphics controllers and SunSoft software. $844 
(100s). 

This 80C51-compatible microcontroller features a watchdog timer, 84 

32 Kbytes of ROM, and 512 bytes of on-chip RAM. The extra 
RAM allows the CMOS device to run compiled application pro¬ 
grams in PL/M and C languages and provides space for context 
switching for stack enhancements in internal memory. DIPs, 

PLCCs, or quad flat packs; EPROM and one-time-programmable 
versions available. 

85 

Line of 50- to 120-MHz clock oscillators produces HCMOS/ACMOS- 
compatible output and drives 15 standard TTL loads. The cry.stal 
operates in fundamental mode at all frequencies. DIPs and mini 
DIPS with plastic/J and Gull-Wing leads; from $3 (1,000s). 

86 

A 128-bit, CMOS serial device interfaces directly to a 2-wire serial 
bus and features softw'are protocol that allows operation at a 1-MHz 
clock rate. 8-pin plastic DIP, type P, and pla.stic SOIC, type S; from 
45 cents (1,000s), 
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Manufacturer Model Comments 


R.S.# 


Boards 

Data Translation 


Micro Express 


Newer Technology 


Parsytec 


DT385 One-monitor, PC AT/EISA-compatible boards combine a frame 87 

image grabber and graphics processor to acquire, digitize, and display 

processors standard and nonstandard video signals on a VGA monitor. An 

on-board TMS34020 processor accelerates Windows 3.0 graphics 
display; permits scaling and variable screen placement; and 
speeds arithmetic, convolutions, and morphological operations. 

From S2,995, depending on memory. 


Hi-Color Turbo Designed around the Tseng Labs graphic chip set and the Sierra 88 

VGA card Semiconductor 15-bit DAC, the 32,768-color card supports 800 x 

600 and 640 x 480 resolutions, as well as 1,024 x 768 resolution 
with 256 colors. A Turbo switch permits the card to switch 
between zero- and one-wait-state operation. $185. 


fx/Overdrive Variable-speed (40-/55-MHz) accelerator, when ainning at its 89 

accelerator fastest setting, promises to speed stock Macintosh Ilfx perfor¬ 

mance approximately 40 percent. When teamed with 16-Mbyte 
SIMMs, the surface mount device lets the Ilfx act as a worksta¬ 
tion. The accelerator’s motherboard installation leaves the Nubus 
and PDS slots open. 


BBK-S4 Sbus/ Adapter board and accompanying software let Sparcstations and 90 
transputer compatibles serve as a standard host to large-scale transputer 

interface systems for data transfers up to 8.8 Mbytes/s. The T225 and 

controller-equipped Sbus slave also assists image processing, real¬ 
time, and other computation-intensive applications and links with 
up to four external systems. $3,950; quantity pricing available (10s). 


Software 

Microware Systems 


MIPS Computer Systems 


Slate Corporation 


Polytron source Utility package for automatic version control of application 91 

code control source code supports OS-9 and OS-9000 real-time operating 
systems. The set of 10 integrated modules supports multiuser 
development environments and features built-in file and user- 
access security. From $895; available on disk or tape. 


R3000 RISC microprocessor softw'are development tools run on Sun-4/ 92 

Riscross tools Sun OS workstations and VAX/VMS systems. The cross¬ 

development tools include a K&R-compliant C compiler, System 
Programmer’s Package, Cache 3000 memory simulator, and SPP/e 
development package. $40,000 (set); from $9,000 (individual 
prices). 


Penbook Electronic book Author and Reader software lets users translate, 93 

compress, and store Postscript documents and read them on a 
pen-based computer. Documents displayed as mixed text/ 
graphics pages can be searched for user-defined words or 
phrases. A markup layer lets users annotate documents with 
personal notes. $695 (Author); $99 (Reader). 
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Product Summary 


Manufacturer Model 

Comments R.S.# 

Peripherals 

HDS Viewstation 

FX terminals 

X Window, RISC-based, 14-, I 6 -, and 19-inch, 256-color temiinals 94 

feature an Intel iS)60 CPU and two ASIC processors for communica¬ 
tions and graphics integrated onto one board. Local Open Look and 

Motif window managers reduce network traffic and improve 
interactive performance. From S2.799. 

Mitsubishi P-78U 

video printer 

Monochrome auto.scanning video printer promises to deliver 6x8- 95 

inch, A4-size prints with 1,280 horizontal dots per line in 256 shades 
of gray in 24 .seconds. Composite, S-VHS, analog, TTL RGB, and 

Centronics parallel port input print in positive/negative, mirror- 
image, and multiformat print styles. S2,999. 

RGB Spectmm I 6 OOU 

scan converter 

Video .scan converter with zoom, antialiasing, and 24-bit color pro- 96 

cessing changes high-resolution computer graphics to television 
fomiat in real time. The unit automatically synchronizes to computer 
displays with 20/90-kHz horizontal scan rates. An optional RS-232 
port lets users control all functions from a computer or ASCII 
temiinal. 


Miscellaneous 

Aristo Computers Simcheck 

adapters 

Memorv' tester family adds four adapters (ZIP memory chips; PLCC 97 

and SOJ memory chips; bank; and Apple Macintosh Ilfx SIMM). 

Basic Simcheck tests standard SIMM and SIP memory modules with 

8 or 9 bits of 64K to I 6 Mbytes. A two-line LCD shows in.stnictions 
and test results. S99 to S345 (adapters), S995 (Simcheck). 

Lucas Duralith LDClOO 

controller 

Controller with internal EEPROM, 8-bit A/D converter, and serial 98 

interface lets designers test the applicability of touch-screen tech¬ 
nology to their products by touching the screen at two opposite 
comers of the active display area. The controller is part of a devel¬ 
opment kit that includes a touch screen, two product development 
software disks, a user's guide, 220V to 1 lOV transfomter and plugs, 
and appropriate cables and connectors, S695. 

Philips Semiconductors KMllOBH/lx 

sensors 

Hybrid magnetoresistive modules measure rotational speed down 99 

to zero using a toothed tachometer wheel with an inductive or Hall- 
effector sensor. Separations between tooth and sensor can be 
several millimeters, and tooth strucmre does not have to be defined 
clo.sely. Samples available. 

Solectek Pocket fax 

modem 

Portable, 9,f)00/4,800-bp.s .send/receive modem supports DOS, 100 

Windows, and Macintosh applications. Users send faxes from within 
applications via a pop-up fax menu; they may continue working in 
the foreground or leave the application by entering a telephone 
number and nomial printing commands. From S299.95, depending 
on application. 
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Software Report 


continued from p. 9 

Micromachines will make it possible 
to create new medical equipment ca¬ 
pable of performing diagnosis and 
treatment simply, accurately, and with 
less need for surgery. They will also 
contribute to the advancement of 
microsurgery techniques (eye surgery, 
suture of microscopic blood vessels, 
and so on) and the development of arti¬ 
ficial organs to be placed inside the 
body. Biotechnologists will apply 
micromachines to cellular manipulation 
such as well separation, injections into 
cells, and cell fusion. 

Announcements 

NSF sponsors an active program for 
US scientists interested in working in 
Japan. As part of that program NSF pre¬ 
pared a directory of 150 Japanese com¬ 
panies that are willing to receive 
American researchers at their laborato¬ 
ries. The directory lists the company 
name, activity infomiation, personnel, 
facilities, and research they will sup¬ 
port, along with contact information. 
For copies of the directory, write to Ja¬ 
pan Programs, Division of International 
Programs, National Science Founda¬ 
tion, Washington, DC 20550; fax (202) 
357-5839. 

The English summary of the outcome 
of the preliminary sriidy on NIPT (or 
Sixth Generation Project) that I reported 
on in the April issue has been released. 
The Industrial Electronics Division of 
Ml'n published the 150-page “Report of 
the Research Committee on New Infor¬ 
mation Processing Technology” in 
March 1991. 
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The 

Information 

Flood 

Trying to manage the flootd of information that passes 
before you can be a frustrating experience. Potentially use¬ 
ful information can be lost when you lack the means to or¬ 
ganize and make sense of the multiple sources that arrive 
daily—as often as not, unbidden. 

IEEE Microns 

On the Edge... 

... offers a solution to this continuing problem. 

In August, On the Edge begins a two-part tools discussion by James D. 
Gafford. The series will illustrate fairly simple ways for you to make use of 
sophisticated information management tools. The commercially available PC 
tools ( MS-DOS or Macintosh) combine ease of use with information manage¬ 
ment power and flexibility. A common theme running through the series will 
be the creation and maintenance of a tool you can use to keep track of the 
information you read in IEEE Micro and other technical publications. 

LOOK FOR THE AUGUST ISSUE 
of IEEE Miax> 

It will help you manage the information flood while 
gaining a better grasp of software tools and soft¬ 
ware issues in general. 


February 1992 87 












Advertiser/Product Index 


FOR DISPLAY ADVERTISING INFORMATION, CONTACT: 


Western Region: D. Rodney Brooks; Tel; (415) 905-0260; Fax: (415) 
896-1512. 

Eastern Region: Georgette Boone; Tel: (415) 905-0260; Fax: (415) 
896-1512. 

Recruitment and Classified Advertising: D. Rodney Brooks; Tel: 
(415) 905-0260; Fax (415) 896-1512. 

Director of Sales: Randall L. Stickrod, 544 Second St., Suite 200, San 
Francisco, CA 94107; Tel: (415) 905-0260; Fax: (415) 896-1512. 


For production information, conference, and classified advertising, 
contact Heidi Rex or Marian Tibayan. 

IEEE MICRO, 10662 Los Vaqueros Cir., PO Box 3014, Los Alamitos, 
CA 90720-1264; phone (714) 821-8380; fax (714) 821-4010. 


Coming 

Next Issue 

The April Issue of lElElE Micro features articles se¬ 
lected from presentations at the third annual Hot 
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Society’s Technical Committee on Microprocessors 
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Don’t miss 

Hot Chips III 

Read the April 1992 Issue of 
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The LiSBUS™ Async I/O System 


A solution of the 1990’s for today’s data transmission problems. 



Outstandingly Simple and Reliable because LiSBUS"" is 

based on a breakthrough technology which uses the impe¬ 
dance of the bus cable to replace binary addresses. Conse¬ 
quently, data transmission management is greatly simpli¬ 
fied and much more reliable than today's equivalent sys¬ 
tems which require expensive sofware, hardware, and 
personnel investments. 

Outstandingly Practical because is it easy to install and 
easy to operate. No special tools, workbench, or electronics 
expertise are needed. Anyone can be up and running in 
minutes. Just plug in the external modules and configure 
the system with the user-friendly LiSBUS"" Link Control 
Software. Each external module measures only around 
2in. by 2in. 


combined into one product without spending a much higher 
amount. The Starter Pack includes all the user needs to 
connect four peripherals and a complete set of LiSBUS*"' 
Development Tools to create custom applications. 

LiSbus"": 

A product of our new CornmNexus'” line of communication systems. 


Special Pre-release Offer: you can receive 
the LiSBUS*"’ Starter Pack at the special pre¬ 
release price of $550. Just call or mail in your 
order by February 29, 1992, at the latest. De¬ 
liveries will begin early March. 

For more information and ordering contact: 


Outstandingly Flexible because a user can connect up to 
60 peripherals or computers to a controlling computer 
through their RS-232C (COM) ports. To add peripherals, 
just extend the bus cable and add modules. 


In the USA and Canada: 

Toll Free (800) 945-3002 

Mon.-Fri. 9am - 9pm EST 


At an Unbeatable Price because at $650* for the UiSBUS"" 
Starter Pack, no alternative can offer all these advantages 


* specifications and prices subject to change without prior notification 
Visa and MasterCard/EuroCard accepted. - CommNexus’''” and LiSBUS™ are trademarks of GIGATEC SA 


GIGATEC (USA). Inc. 

871 Islington Street 
Portsmouth. NH 03801 USA 
Tel. (603) 433-2227 
Fax (603) 433-5552 
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